# **Chapter 4**

# **Time-Domain Motion Detection**

# **Architecture**

### 4.1 Introduction

Visual motion processing plays essential roles in various applications, such as in automotive guidance, robotics control, and security surveillance. Since motion-related algorithms are generally computationally very expensive, software solutions cannot attain real-time response capability. Therefore, hardware solutions employing smart vision chip architecture are being pursued very extensively [16, 22, 27, 52].

In smart vision sensors, photodiodes are integrated with an array of SIMD (single instruction multiple data) processing elements, and real-time performance is achieved by massively parallel processing on the chip. In most cases, analog circuitry is used to build a processing element because it allows compact and power-efficient implementation [16,22,52]. However, it is difficult to change functions in hardwired analog circuits, while digital processing is very flexible in implementing algorithms. Therefore, where to place the boundary between the analog processing and the digital processing is an important issue in the design of vision processing systems.

The purpose of this chapter is to present an analog VLSI motion sensor chip in which the most time consuming part, the computations of spatial and temporal derivatives at each pixel, are carried out in analog processing, while other processing are all in digital. The time domain technique has been employed in the analog processing [53]. Namely, the pixel intensities are represented by pulse widths and the derivatives are simply calculated by XOR gates, then being analog-to-digital converted using binary counters. In addition, due to the four-way photodiode shared architecture developed in this work, a seamless operation of photo integration, spatial derivative operation and temporal derivative operation has been established. A proof-of-concept chip containing a  $31 \times 31$  detector array was fabricated in a 0.35- $\mu$ m CMOS technology and the operation of the chip at a rate of 400 frames/sec was successfully demonstrated under a 3.3V power supply.

## 4.2 System Organization

#### 4.2.1 Overall Architecture

Fig. 4.1 shows a simplified block diagram of the proposed motion sensor chip architecture. The chip contains an array of gradient detectors, counters for time-to-digital conversion, and row-selector and column-selector circuitry for controlling the array. The gradient detector array is controlled by *PE Control Signals* and *Common Ramp* signal. Each gradient detector calculates temporal and spatial gradients at the location using the time domain technique. The time-domain signals (pulse-width modulated signals) thus obtained are converted to digital values by counters allocated at the bottom of the array, and the digital values are provided as an output from the chip.

#### 4.2.2 Unit Structure of Array

Fig. 4.2 illustrates the detailed diagram of a unit structure of the gradient detector array, which consists of a gradient detector, photodiodes (PD's) and PD readout control circuits (C1  $\sim$  C4). Both PD's and gradient detectors are laid out in the same square lattice arrangement, where the two



Figure 4.1: System organization of proposed architecture.

lattices are shifted from each other by a half pixel pitch in both x and y directions. Each PD is shared by four readout control circuits. The readout circuit has two ports, A and B, which transfer the PD signal to two neighboring gradient detectors.

The detailed schematics of PD and its readout control circuits are illustrated in Fig. 4.3. The readout control circuit is composed of a shared photodiode as mentioned above, a reset transistor (M1), a switching transistor that controls exposure time (M2), a sampling capacitor, a source follower to read out the sampled voltage, and a series of transistors which control the pixel data transfer to output ports A and B. The signals  $MODE\_T$  and  $MODE\_S$  are common in all four circuits C1  $\sim$  C4, while the destination-select signal  $SEL\_n$  is connected to only two circuits in such a way that  $SEL\_2$  is connected to the readout circuits of C2 and C1, and  $SEL\_3$  to C3 and C2, etc.

The transistors M1 and M2 are activated independently in each circuit, thus allowing differenttiming data sampling from the same photodiode. Then, by controlling the series of transistors, the



Figure 4.2: The detail block diagram of the gradient detectors array.

following two functions are realized by a gradient detector.

### **Temporal Gradient Mode**

in order to compute a temporal gradient,  $MODE\_T$  and one of  $SEL\_n$ 's are activated. For example, when  $MODE\_T$  and  $SEL\_2$  are activated, the circuit of C1 transfers the sampling voltage from the port A1, and the circuit C2 transfers the sampling voltage from the port B2. As a result, the sampling voltages of the same photodiode at different timing are delivered to the detector (See Fig. 4.2).

#### Spatial Gradient Mode

a spatial gradient is evaluated when MODE\_S and one of SEL\_n's are activated. In this case, only one of readout circuits works, i.e. the sampled voltage in the circuit C1 is transmitted to the detectors locating at both above and below C1 through the ports B1 and A1, respectively, when



Figure 4.3: Schematic of photodiode and readout control circuits.

MODE\_S and SEL\_1 are activated.

## 4.2.3 Time-domain Gradient Detector

Fig. 4.4 illustrates the basic configuration of the time-domain gradient detector. The gradient detector consists of two differential amplifiers (Diff. Amps. A and B), a flag comparator, and three tri-state inverters for output control, as shown in Fig. 4.4(a). The output ports of surrounding PD readout control circuits are connected to the positive terminal of either differential amplifier A or B, in such a way that the four ports A from circuit  $C1 \sim C4$  (noted as  $A1 \sim A4$  in Fig. 4.2) are connected to the amplifier A, and four ports of B are assigned to the amplifier B. As described in the previous section, only one of the four ports is selectively activated, then an analog value transferred from the activated port is converted to time-domain signal in the following way.



(a) Time-domain Gradient Detector



(b) Flag Comparator

Figure 4.4: Schematic of time-domain gradient detector. (a) Gradient detector and its operation. (b) Flag comparator.

The common analog ramp signal is fed to the negative nodes of all amplifiers. Each output of the amplifier ("Flag" in the figure) is high ("1") at the beginning of operation, then it change to low ("0") as the ramp exceeds the analog voltage applied to the positive node. Namely, the magnitude of each input voltage of the amplifier is converted to the timing of signal transition, as shown in the figure. These two time-domain signals are then processed in a flag comparator.

The flag comparator depicted in Fig. 4.4(b) illustrates the XOR circuit designed specific to our purpose based on a pass-transistor logic. An XOR function generates a pulse width representing

the absolute value of the timing difference between the two flag signals as shown in 4.4(a). Besides the XOR output, the plus and minus signs that indicate which flag reaches first are generated. As a result, a gradient is computed in combination of these signals.

The time-domain gradient detection is conducted in the row parallel operation. Namely, the row selector activates one of the rows, and time-domain gradient detections are carried out in parallel in the row (See Fig. 4.1). Time-domain signals generated in the detectors are transferred to counters locating at the bottom of the array, and time-to-digital conversion is accomplished by the counters. The counters utilized in this architecture is designed as an up/down counter. The pulse of the XOR flag generated by the comparator activates the counter, and one of the sign flags is employed as up/down control. In this way, on-chip conversion of the gradient to the signed digit value is realized.

A ratio of temporal to spatial gradients gives a "normal optical flow" which is the apparent velocity of movement in the obtained image [54]. However, a small value of a spatial gradient causes divergence of optical flow. Therefore, small spatial gradients below certain threshold must be ignored in the calculation of optical flow. Digital processing provides flexibility in such operations, thus the procedures following the gradient detection are performed by an external digital processor in this architecture.

### 4.2.4 Seamless Operation

Fig. 4.5 shows an example of the timing chart, which describes the operations conducted in the readout control circuits  $C1 \sim C4$ , and in the gradient detector.

Although the gradient evaluation in each readout control circuit is conducted after the resetting and exposing PD, parallel operation can be achieved since the PD is shared by four readout circuits. Therefore, continuous evaluation is achieved in the gradient detector.

Regarding the entire system, the gradient detection must be conducted by every row of the array as mentioned above, thus the sequential scanning for the rows is required to obtain all gradients of the entire array. However, it is not critical in terms of processing time since the total scanning time



Figure 4.5: Timing-chart example of seamless operation.

is at most a few hundred micro seconds, while the typical photo integration time is in the order of msec. Namely, required evaluation of temporal and spatial gradients for optical flow calculation is realized with two readout circuits without waiting for resetting and exposing PD as shown in the figure. In this manner, high frame rates can be achieved.

# 4.3 Experimental Results

### 4.3.1 Prototype Chip

In order to verify the proposed architecture, the proof-of-concept chip was designed and fabricated using  $0.35\mu m$  standard CMOS technology. The chip includes  $31\times31$  gradient detectors and  $32\times32$  PD's. The photomicrograph of the chip is shown in Fig. 4.6(a) and the specifications are summarized in Table 4.1. Fig. 4.6(b) shows the pattern layout of the gradient detectors and its



Figure 4.6: (a) The photomicrograph of proof-of-concept chip. (b) The pattern layout of the unit of gradient detector array.

Table 4.1: Specification summary of proof-of-concept chip

| Process Technology    | 0.35 µm CMOS, 3-Metal              |
|-----------------------|------------------------------------|
| Core Size             | $2.9 \times 2.9 \text{ mm}^2$      |
| Supply Voltage        | 3.3 V                              |
| Operating Frequency   | 20M Hz                             |
| Maximum Frame Rate    | 400 frames per sec.                |
| Power Consumption     | $\sim$ 25 mW (W/O Ramp Generation) |
| Array Size            | 31×31                              |
| Size of 1 Unit        | $80\times80~\mu\mathrm{m}^2$       |
| Number of Transistors | 40 Tr. (Gradient Detector)         |
|                       | 40 Tr. (PD and Readout Circuits)   |
| Fill Factor           | 4.5 %                              |



Figure 4.7: Measured waveforms of prototype chip with temporal gradient mode. (a) The experimental situation which emulates the illuminance was changed as the proportion of the step function The waveforms under the dim illuminance (b) and under the bright illuminance (c).

peripheral circuits (The dashed line in this figure contains the unit circuitry allocated repeatedly in the array). Since the top metal layer was used to cover the array to prevent from exposure, only two metal layers were utilized for wiring between circuits. As a result, the 60% of the area has been occupied by the metal lines. Therefore, by introducing a process technology dedicated for digital circuitry which allows to use more metal layers, the unit size would be made much more compact.

#### 4.3.2 Measured Waveforms

Fig. 4.7 displays the measured waveforms of the prototype chip executing the "temporal gradient mode." The setting prepared for the measurement is illustrated in Fig. 4.7(a). The waveforms were obtained from the unit circuit allocated in the periphery of the array. The results are plotted in Fig. 4.7(b) and (c).

Although the illumination was kept constant during the operation, the readout control circuits were controlled as in the following to emulate the temporal gradient measurement. The switching transistor M2 in the circuit C1 was always closed, thus the sampling voltage in C1 was held on the reset voltage. On the other hand, the transistor M2 in the circuit C2 was opened for 2.5ms, thus only the voltage in C2 was changed by the illumination. The gradient detector evaluated the temporal gradient between circuit C1 and C2. As a result, the examination emulated the situation in which the illuminant of the light changes as a step function, shown in Fig. 4.7(a).

The common ramp signal and the XOR flag generated by the flag comparator are plotted in Fig. 4.7(b) and (c), and the pulse-width of the XOR flag dependent on the illumination intensity is observed. The edges of the XOR waveform were blurry, because the brightness of the external lamp changed waveringly and the waveforms were obtained as the average of cyclic sampling by the oscilloscope.

### 4.3.3 Optical Flow Detection

Fig. 4.8 demonstrates the experimental results of normal optical flow detection. The object moved diagonally to left-up in front of the chip, and the obtained images of the temporal gradient, the spatial gradients of horizontal and vertical directions, and the sequence of normal optical flow are plotted.

The pixel data of the gradient images in Fig. 4.8(a)~(c) are normalized, thus gray pixel represents the value of zero, and negative values are presented as dark pixels, and positive values are bright pixels. Some isolated black and white pixels are presumable due to unstable operation of the digital counters, and improvements are now under study. The sequence of optical flow shown in the Figs. from (d) to (f) were obtained by digital processing. In addition to dividing the value of the temporal gradients by spatial gradient values, threshold detection to eliminate the divergence of flows was realized by the processor. The timing differences between images are 75ms and the flows indicating the left-up diagonal movement are demonstrated in the figure.



Figure 4.8: Experimental results of normal optical flow detection for the object moving to left-up diagonal direction. (a) Temporal gradient (b) Spatial gradient of X-direction (c) Spatial gradient of Y-direction (d)  $\sim$  (f) The sequence of normal optical flow calculated off-chip processor.

The maximum frame rate was 400frames/sec, which was measured with 20MHz of the operating frequency and 3.3V power supply. The power consumption was 25mW at the maximum without the ramp signal generation.

# 4.4 Summary

In this paper, we have presented an architecture to realize high-speed gradient detection for analog motion sensor chips. A time-domain gradient detector computes and generates pulse-width signals representing both temporal and spatial gradients, and these signals are converted to signed digit values by on-chip processing. Furthermore, the seamless capturing operation is introduced by a shared photodiode configuration. Experimental results obtained from the fabricated test chip using 0.35  $\mu$ m CMOS technology have been demonstrated and the frame rate of 400 frames/sec was successfully achieved.

# Chapter 5

# Mixed-Signal Focal-Plane Image

# **Processor**

### 5.1 Introduction

Focal-plane image processing plays essential roles in real time vision systems such as image recognition, object tracking, surveillance, security systems, and so forth. Since the processing involves various spatiotemporal convolutions at each pixel location, the computation is very expensive, making it very difficult to achieve a real-time response capability using only software solutions running on general-purpose processors. Therefore, hardware solutions employing dedicated architectures are essential for real-time vision systems. Among those, *smart image sensor* architecture are being pursued very extensively because of its capability for massively parallel operations.

In the *smart image sensor* architecture, photodetectors are integrated with an array of SIMD (single instruction multiple data) processing elements, and real-time performance is achieved by massively parallel processing on the chip. Since analog circuitry makes possible to implement much simpler and power-efficient configuration, integration of photodetecting elements and analog processing circuits on the same chip has been discussed from the late 1980's [16]. This configura-

tion is often refereed to as "silicon retina," and has inspired many engineers to develop retinomorphic vision systems that also imitate these parallel processing capabilities [18, 23–26]. However, the major drawback of the analog circuitry is that it is difficult to change functions in hardwired analog circuits. Therefore, most of analog smart image sensors are often optimized to realize specialized image processing. On the other hand, digital implementations offer a large flexibility and programmability in accommodating the system to various algorithms. For example, Ishikawa et. al. have proposed a digital SIMD vision chip aiming at high-speed target tracking [27–30], in which 1b general purpose ALU is integrated into each pixel. However, one of the drawbacks in digital implementation is a volume of the circuitry. The digital processing element requires analog-to-digital conversion and one or many digital processing components such as registers, memories. Therefore, the scale of the circuitry becomes relatively large, thus making power consumption being increased. As a result, where to place the boundary between the analog processing and the digital processing is an important issue in the design of image processing systems.

A time-domain technique is proposed as analog-digital merged circuit architecture where pulse width modulation (PWM) signals are utilized for processing [53, 55], and the concept has been introduced to a smart image sensor which realizes spatial image processing [56]. In their works, a PWM signal represents analog information on its pulse width while having a binary amplitude. Analog circuits are employed to process the information based on the concept of a switched current integration technique, while digital logic gates are exploited to serve as operators of continuous pulse widths as just the signals pass through. As a result, the signals have both analog and digital features so that their processing circuits can enjoy the architectural advantages of both of techniques.

One of the challenges in the time-domain signal processing is where to implement storage elements for results of processing. PWM signals are not able to be stored as it is, thus requiring domain conversion from time to analog or digital in the storage devices, which requires relatively larger area than that of processing devices. In the conventional time-domain smart image sensor [56], the processing elements are connected in series and the storage device is equipped at a edge of processing.

Namely, pixel-level processing is carried out continuously, and domain-conversion is conducted at a global-level operation. However, it is difficult to carry out temporal image processing in such a configuration because it often requires pixel information at different timings. As a result, it is essential to implement storage elements even at a pixel for temporal image processing, which is necessary to realize various motion-related algorithms.

In this chapter, we have developed a mixed-signal focal-plane image processor capable of performing spatial and temporal convolutions adaptive to various kernels. Both the compactness of analog and the programmability of digital have been achieved by employing the time-domain computation technique in pixel level processing. Namely, pixel information is represented as the pulse width as in [53], and all computations are carried out by simple digital logic and a binary counter based on a time-domain arithmetic operation proposed in this work. In addition, operation results are stored in digital format in every pixels, allowing any temporal computation without degradation in pixel data. The concept has been verified by the proof-of-the concept chips fabricated in  $0.18-\mu m$  CMOS technology.

# 5.2 System Organization

### 5.2.1 Overall Architecture

Fig. 5.1 shows a block diagram of the proposed focal-plane image processor. The processor contains a pixel array and a peripheral circuitry for controlling the array: row-selector, line registers and column-selector circuitry. The pixel array is composed of processing elements (PE's) and photodiode (PD) units. Both PE's and PD units are laid out in the form of an identical lattice, but the two lattices are shifted from each other by a half pixel pitch in both x and y directions [33]. Thanks to this arrangement, each PE can receive data either from four PD units or from four PE's in its neighborhood, thus allowing to gather data necessary for processing in minimal steps without crossover interconnections.



Figure 5.1: Array configuration with only-nearest-neighbor interconnection.

Each PD unit is composed of a photodiode (PD) and analog-to-time converters (ATC's), where photo current integration and conversion the analog voltage to pulse width modulation (PWM) signals is conducted. The PE carries out convolution operations on the PWM signals received from neighboring PD units or PE's, and stores the result in itself. The PE also generates a pulse signal proportional to the stored value, thus transmitting the pulse to neighbor PE's. In this manner, every data transmission among PE's and PD units is carried out on a single interconnect by a pulse bearing information in its width.

Up to 2×2 kernel size, convolution is realized by a single PE for each pixel site. For kernel sizes



Figure 5.2: Processing for kernel sizes larger than  $2\times 2$ . (a) Pixel data acquisition by PE-a's; (b) data transmission from PE-a's to PE-b's.

larger than 2×2, Inter-PE data transmissions is required. Fig. 5.2 shows processing for kernel sizes larger than 2×2. All PE's are divided into two groups (PE-a and PE-b) of a checker board pattern. At first, PE-a's gather the data from four PD units in the neighbor, as shown in Fig. 5.2(a). Then, Each PE-a generates the pulse as the processing results, and PE-b's carry out the convolution on these pulses (Fig. 5.2(b)). After this, the roles of PE-a and PE-b are exchanged. By repeating such processes, convolutions with any size of kernels can be accomplished.

# 5.2.2 Principle of Time-Domain Arithmetic Operation

The principle of time-domain arithmetic operation is shown in Fig. 5.3. The PE's carry out the convolution with PWM signals according to the principle. First of all, we define an operation slot which has a particular number of clock cycles. It means the resolution of the operation. For example, the clock cycles of the operation slot is set to  $256 \ (=2^8)$  in the figure. For convolution



Figure 5.3: Time-domain arithmetic operation.

operation, a PWM signal (Pulse) is converted to the digital number by counting clock cycles where the pulse is "1". If signed digit numbers are required in the operation, the maximum width of Pulse is controlled to be smaller than half of the clock cycles on the slot. In this case, the inverted PWM signal ( $\overline{Pulse}$ ) represents the minus values of the original PWM signal. This is because the width of  $\overline{Pulse}$  is equal to 256–X (X is the width of Pulse), and 256 disappears due to overflowing of the operation slot. In addition, multiplication is realized by taking logical AND between Pulse and a weight signal Weight as shown in the bottom of the figure. In this manner, basic arithmetic operation in digital domain is accomplished by simple binary logic functions.

## 5.3 Building Block Circuits

### 5.3.1 Photodiode Unit and Analog-to-Time Converters

Figure 5.4 illustrates the schematics and the operation timing chart of a PD unit containing a photodiode (PD) and analog-to-time converters (ATC's). The ATC is composed of a reset transistor, a CMOS switch for electronic shutter, and an analog comparator. Two ATC's share one common PD and they generate pulse signals according to the pixel intensity. Each ATC is controlled independently, thus allowing data sampling from the same photodiode at different time stamps. In addition, one ATC can be utilized for processing while another ATC carries out photo-current integration.

The operation of conversion from analog voltage to PWM signal is shown in Fig. 5.4(b). After the photo-current integration, the sampled voltage is applied to the negative node of the comparator, being compared with the common analog ramp-down signal fed to the positive node of all comparators. At start of the ramp signal, output of the comparator is "1", then changes to "0" when the ramp goes below the sampled voltage. Namely, the magnitude of the pixel intensity is converted to the width of a pulse signal, thus analog-to-time conversion being accomplished. The ramp-down period from the begging to the end is controlled according to the operation slot of the processing. For example, the ramp-down period is set to 128 clock cycles in the example shown in Fig. 5.3.

#### 5.3.2 Processing Element

The processing element (PE) is depicted in Fig. 5.5. It consists of three major components: four input controllers, a pulse width counter, and a pulse generator.

In each input controller, the selector decides which PWM signal to receive either from a PE or from a PD unit in its neighbor. Then, the PWM signal is manipulated by a mask unit according to a 2b mask code, thus realizing the operation principles explained in Section 5.2.2. The schematic and the operation table of the mask unit is shown in Fig. 5.6. The PWM signal is passed through as it is, when the 2b mask code is "01". On the other hand, The PWM signal is bit-inverted when the mask





(b)

Figure 5.4: Photodiode units including PD and analog-to-time converter (ATC): (a) schematics; (b) operation timing chart.



Figure 5.5: Processing Element



Figure 5.6: Masking Unit: (a) schematic; (b) operation table



Figure 5.7: Variable-Stepping 11b Binary Counter

code is "10", which is regarded as a minus value of the pulse width. In addition to these functions, the pulse signal is always converted to zero as NOP when the mask is "00", while the pulse is "1" when mask code is "11". In this manner, four input controllers define a  $2\times2$  kernel with  $\pm$  weight values assigned as mask codes.

The manipulated PWM signals are accumulated in the pulse width counter, which consists of a 4-input 1b adder circuit and a variable-step 11b counter. The number of "1" in the four PWM signals from input controllers is counted by the 4-input adder. Then, the 3b result is provided as a count-up step to the variable-step binary counter. Fig. 5.7 shows the details of the variable-step 11b binary counter. The first stage of the counter is composed of a 3b full adder and a 3b register. In the first stage, the 3b results from the 4-input adder (the sum of 1's in the input pulses) is added to the values in the register. In addition, the signal *Dataout* is applied as *Carry In* to the lowest bit. Then, the carry out signal of the 3b full adder is transfered to the second stage. The second stage is an 8b binary counter composed of a series of eight bit-slice circuits. Each bit-slice circuit consists of a toggle flip-flop and a logical AND circuit, generating the sum and the carry of the corresponding bit. The values in the 3b register and the sums of the 8b binary counter is concatenated as an 11b

accumulated value *Result*, thus being read out as the processed result after the operation. In order to control the resolution of the operation slot, one of the carry signals from upper four bit-slice circuits can be selected as a pulse generation trigger, which is transferred to the pulse generator. A flip-flop inserted after the selector is employed to eliminate signal hazards during the state transition.



Figure 5.8: Operation timing chart.

An example of the operation chart in the PE is shown in Fig. 5.8. The left half of the figures explains the chart of accumulation of four PWM signals. When four inputs are all "1", the 11b register value (*Result*) increases by 4 at each clock. If three of the inputs are "1", *Result* increases by 3, and so forth. Consequently, *Result* yields the sum of pulse widths from all four inputs in digital format.

A domino-logic inverter is used as the pulse generator to convert *Result* to a pulse width signal. When *Dataout* is "0", the output of the inverter is always "0". Then, a pulse generation begins when *Dataout* is set to "1" and all inputs of the 4-input adder is "0". In this setting, the value in the

variable-stepping counter is counted up one by one (See Fig. 5.7). During count up, *Carry* from the variable-step counter remains "0" until *Result* reaches 256 when *Resolution* is set to 8b. Only at the transition of *Result* from 256 to 0, *Carry* becomes "1", and the domino inverter turns around. After that, the output of the domino inverter remains "1" until operation stops. As a result, a pulse signal having the width proportional to the initial value of *Result* is generated. In addition, *Result* restores its initial value after 256 clocks, thus allowing the reuse of the data repeatedly.

# 5.4 Experimental Results and Discussions

### 5.4.1 Design of 31×31 Focal-Plane Image Processor

### **Chip Fabrication**

To start with a feasibility study, a first version of a prototype chip containing  $31 \times 31$  PE's and  $32 \times 32$  PD units have been designed and fabricated in 0.18- $\mu$ m 1P5M CMOS technology. Since the purpose of this development is to verify the design feasibility of the array arrangement, the time-domain arithmetic operation explained in Section 5.2.2 have not been implemented. Therefore, the convolution of plus and minus numbers are implemented as separated operations. The configuration of the PE is modified as in the following.

Fig. 5.9 illustrates the details of the PE in the first version chip, which consists of four input controllers, an up/down counter and a pulse generator. In each input controller, a logical AND is taken with the received PWM signal and the mask bit in the 1b register. Namely, four input controllers define a  $2 \times 2$  kernel with 1b weights assigned as mask bits. The major components in the up/down counter are two adders: 4-input 1b adder (**A**); 8b ripple carry adder (**B**), and an 8b register. The number of "1" in the signals from input controllers is counted by adder **A**. Then, each bit in the 3b output from adder **A** is taken XOR with the *Up/Down* signal, and then converted to an 8b number by filling "00000" in upper 5 bits if Up/Down=0. ("11111" is filled if Up/Down=1.) Finally, the result is accumulated in the 8b register as *Result* using adder **B**. An example of the operation chart



Figure 5.9: Processing element of the first version chip.

is shown in the figure. When four inputs are all "1", the 8b register value (Result) increases by 4 at each clock. If three of the inputs are "1", Result increases by 3, and so forth. Consequently, Result yields the sum of pulse widths from all four inputs in digital format. When Up/Down=1, the output of adder **A** is bit-inverted and filled with "11111" in upper bits. In addition, if Dataout=0, Carry In becomes "1", meaning the number is subtracted from Result as 2's complement. Thus convolution of four input signals with  $\pm$  signs is accomplished.

A photomicrograph of the first version chip is shown in Fig. 5.10 and the specifications are summarized in Table 5.1. The maximum operating frequency is 20MHz with 1.0V supply, where power dissipation is 6.7mW excluding ramp signal generation. Since it takes 256 clock cycles for 2x2 kernel convolutions, the processor performs over 78,000 convolutions/s. The measured waveforms are also given in Fig. 5.11. Due to the synchronization between the oscilloscope and the



Figure 5.10: Photomicrograph of the first version chip.

Table 5.1: Specifications of the second version chip

| Process Technology   | 0.18 μm CMOS, 5-Metal                            |
|----------------------|--------------------------------------------------|
| Core Size            | $3.0 \text{ mm} \times 3.2 \text{ mm}$           |
| Effective Pixels     | $31 \times 31$ Processing Elements               |
|                      | $32 \times 32$ Photodiodes                       |
| Pixel Pitch          | $89.25~\mu\mathrm{m} \times 89.25~\mu\mathrm{m}$ |
| Transistors Count    | 820,000                                          |
|                      | (824 Transistors / pixel)                        |
| Fill Factor          | 3.63                                             |
| Supply Voltage       | 1.0 V                                            |
| Max. Clock Frequency | 20M Hz                                           |
| Power Consumption    | 6.7mW (@ 20MHz, 1.0V)                            |
|                      | (w/o Ramp Generation)                            |
| Max. Operation Rate  | 78,125 convolutions / sec                        |



Figure 5.11: Measured waveforms.

external DAC for ramp signal generation in the measurement setup, the figure shows the processor operating at 1MHz with 1.0V supply.

#### **Measurement Results**

Measurement results of focal-plane image processing performed at 20MHz are demonstrated in Fig. 5.12. For correlated double sampling (CDS), the reset voltage of each pixel is sampled and stored as a minus number in the PE. After photo-current integration, the result is accumulated onto the minus number, thus realizing digital double-sampling for FPN cancellation. The result of spatial convolution for 45-degree edge enhancement is shown in Fig. 5.12(b). As shown in Fig. 5.13, the PE's are divided into two groups (PE-a and PE-b). For convolution at the location of PE-b's, there are three steps. At first, PE-a's receive and accumulates the pulses from four PD units in the neighbor. Next, each PE-a generates a pulse signal according to its own convolution result, and



130ms / Frame (120ms: Photo-Current Integration, 10ms: Operation w/ Readout)

(c) Temporal Intensity Difference Detection

Figure 5.12: Measurement results.

each PE-b accumulates the pulses from right and bottom of PE-a's as plus numbers. Finally, each PE-a generates the pulse again, and each PE-b accumulates the pulses from left and top PE-a's as minus number. In the case of the convolution at the location of PE-a's, these steps are repeated by exchanging the roles of PE-a and PE-b. The temporal intensity difference detection is demonstrated in Fig. 5.12(c), where the pixel intensity differences between two consecutive frames are shown along with the original sequence of a moving image.

### 5.4.2 Design of 64×64 Focal-Plane Image Processor

We have designed a second version of the prototype chip with  $64 \times 64$  processing elements (PE's) and  $65 \times 65$  photodiodes using a  $0.18 \mu m$  1P5M standard CMOS technology. In this design, the time-domain arithmetic operation is fully implemented, thus realizing the compact processing elements. Fig. 5.14 shows a layout pattern of the chip and the specifications are summarized in Table 5.2.



Figure 5.13: Operation details of the convolution with the kernel shown in 5.12(b).

Since the chip is under fabrication at the time of this writing, the simulation results with Nanosim simulator are shown in Fig. 5.15. The subtraction of the results among two processing elements (PE-a and PE-b) is carried out after each PE gather the data from a neighbor PD unit. The operation principle explained in Section 5.2.2 is verified at 100MHz under a power-supply voltage of 1.8V.

Figure 5.16 compares an area efficiency of the processing element between the first version chip and the second version chip. The occupied area of the major components in the PE are shown in the figure. Although the maximum bit length (the operation resolution) is extended from 8b to 11b in the second version chip, the total area is reduced because it employs a normal binary counter while the first version chip utilized the up/down counter. Namely, introducing time-domain arithmetic improves significantly not only the performance in speed but also area efficiency of PE.



Figure 5.14: Layout of experimental design of the second version chip.

Table 5.2: Specifications of the second version chip

| Technology       | 0.18μm CMOS 1-Poly-Si 5-Metal            |
|------------------|------------------------------------------|
| Core Size        | $3.8\text{mm} \times 4.1\text{mm}$       |
| Effective Pixels | 64×64 Processing Elements                |
|                  | 65×65 Photodiodes                        |
| Pixel Pitch      | $54.3\mu\text{m} \times 54.3\mu\text{m}$ |
| # of Transistors | 2,600,000                                |
|                  | (547 in a PE, 72 in a PD unit)           |
| Fill Factor      | 13.4 %                                   |
| Supply Voltage   | 1.8V                                     |
| Max. Clock Freq. | 100MHz (simulation)                      |



Figure 5.15: Simulation results from the second version chip. The subtraction of the results among two processing elements (PE-a and PE-b) is carried out after each PE gather the data from a neighbor PD unit.



Figure 5.16: Area efficiency in processing element.

## 5.5 Summary

A mixed-signal focal-plane image processor for real-time spatiotemporal convolution has been developed based on time-domain computation technique. Pixel information is represented as pulse widths, and all computations are carried out by simple digital logic and a binary counter in each processing element. In addition, the time-domain arithmetic operation has been developed to realize both performance improvement in speed and reducing the circuit volume. As a result, the compactness of analog and the programmability of digital have been achieved. Proof-of-concept chips were designed in a 0.18- $\mu$ m 5-metal CMOS technology, and the concept has been verified by the measurement of fabricated chips as well as by simulation. First version chip has demonstrated over 78,000 convolutions/s with 1.0V supply by the measurement. Second version chip is under fabrication and the operation principles was demonstrated at 100MHz with 1.8V supply by simulation.

# Chapter 6

# **Conclusions**

In this work, parallel-processing VLSI architecture employing a *logic-in-memory* architecture and a *smart-image-sensor* has been presented. In order to realize efficient execution of image processing on two-dimensional image data, a quaternary-tiled pixel-mapping scheme has been developed for the image filtering processor. The concept has been extended to apply the DPS image sensor, thus achieving the parallel readout of block-of-pixel data. And then, the smart image sensor architecture has also been presented employing time-domain technique, which allows us to build a compact pixel processing element having programmability. The followings are summaries through this thesis.

In Chapter 2, a low-power and high-speed image filtering processor conducting a single-clock-cycle convolution with various kernel has been developed. In order to eliminate complicated memory address control and a large number of redundant memory access, a quaternary-tile pixel-mapping method has been developed. In addition, a radix-2 signed digit data transfer scheme is employed in a global data distribution in order to reduce not only wiring area but also power dissipations. A prototype chip was designed and fabricated in a 0.18- $\mu$ m CMOS technology. Without pipelining, the fabricated processor operated at 50MHz with a 1.8V supply is able to conducts image filtering for a  $256 \times 256$  image within 1.31ms, which demonstrates better performance than a 2.2GHz MPU

systems.

In Chapter 3, a computational digital-pixel-sensor architecture which allows the parallel readout of block-of-pixel data for image processing by SIMD processing units has been presented. This architecture is capable of seamlessly scanning a  $5\times5$  pixel-kernel filter across the entire pixel array. As a result, the proposed DPS architecture has compatibility to various image-processing algorithms. In this work, a rank-order-filtering circuit has also been developed as one of the application to demonstrate on-chip image processing. It unifies the rank order filtering algorithm and binary search tree into one simplified circuit. The concept has been verified by a proof-of-concept chip fabricated in a 0.35- $\mu$ m CMOS technology. The chip includes a  $64\times48$  DPS array and eight units of the rank-order-filtering circuit element.

In Chapter 4, we have presented an architecture to realize high-speed gradient detection for analog motion sensor chips. A time-domain gradient detector computes and generates pulse-width signals representing both temporal and spatial gradients, and these signals are converted to signed digit values by on-chip processing. Furthermore, the seamless capturing operation is introduced by a shared photodiode configuration. Experimental results obtained from the fabricated test chip using  $0.35 \ \mu m$  CMOS technology have been demonstrated and the frame rate of 400 frames/sec was successfully achieved.

In Chapter 5, a mixed-signal focal-plane image processor for real-time spatiotemporal convolution has been developed based on time-domain computation technique. This allows us to build a compact pixel processing element having programmability. Proof-of-concept chips have been designed and fabricated in a 0.18- $\mu$ m CMOS technology, and the concept has been verified by the measurement as well as by simulation. The first version chip demonstrates over 78,000 convolutions/s with 1.0V supply by measurement. The second version chip is under fabrication and the operation principles was verified by Nanosim simulation at 100MHz with 1.8V supply.

Although the target application is specified to image filtering operations in this work, the architectural concepts themselves are more general and applicable to other image processing VLSI's.

Since block-readout access scheme for two-dimensional pixel data is essential in a number of image processing algorithms, the single-clock-cycle access method presented in Chapter 2 and Chapter 3 can significantly improved the *latency* of the system. Also, the smart image sensor architectures proposed in Chapter 4 and Chapter 5 enable us to realize more compact and flexible configuration in designing processing elements. Therefore, merging these two architectures would make contributions to enhance performances or efficiency in various intelligent image processing systems. As a result, VLSI architectures proposed in this work would become a driving force to realize real-time advanced image-processing applications in the future.

# Appendix A

## **Time-Domain Winner-Take-All Circuit**

#### A.1 Introduction

In the era of information technology, ever-increasing computational powers are demanded for digital computers. However, further increase in the clock frequencies in microprocessors is encountering severe limitations due to the difficulties in device miniaturization, interconnect complexity and large power dissipation in VLSI chips. Therefore, the replacement of a certain part of time-consuming software processing by direct computation using dedicated VLSI circuits is essential in reducing the number of microinstructions to be carried out in general purpose processors.

A winner-take-all (WTA) circuit is utilized to search for the maximum (or minimum) value in a large data set and to identify its location. The WTA function is an integral part of the vector quantization (VQ) algorithm, which is widely used in intelligent data processing such as in associative memories, Kohonen's self-organizing maps, and audio/visual data compression [57–59]. In the intelligent VLSI system based on the psychological brain model proposed in [59], VQ processing plays a principal role. In the VQ processing system, an input vector is matched with a large number of template vectors stored in the system, and their similarities (or dissimilarities, i.e., the distances between the input vector and template vectors) are calculated and the best-match template vector is identified as the winner. Since this is a time-consuming processing when the number of

template vectors becomes very large, several VLSI VQ processors have been developed both in digital [60–62] and analog [63,64] technologies with the aim of using them in real-time applications. In these VLSI accelerator chips, distances are calculated by SIMD (single-instruction multiple-data) elements in a fully parallel architecture and the winner is identified by a WTA circuitry. In a fully parallel SIMD architecture, the distance data are distributed over the entire chip area. Therefore, fast and accurate searching must be conducted despite the signal delay and latency present in the chip. The problem of signal delay is particularly important when a time domain technique is employed. The purpose of this study is to develop a fast and accurate WTA circuit operating in the time-domain employing a ramp-voltage scan technique.

The first WTA circuit was developed using a current-mode technique in the MOS subthreshold regime [65], and several other current-mode circuits have also been reported [66, 67]. In most of these circuits, a number of cells, each composed of a current source, competitively share the total current specified by a global current sink. Each cell inhibits every other unit, thus the unit with the highest initial activation strongly inhibits all the other units and survives as the winner. Since the circuit operates in the subthreshold regime, very low power dissipation is achieved at the cost of slow circuit operation.

A WTA in the voltage mode operation was first developed using the neuron MOS (neuMOS or vMOS) technology [68,69]. The circuit is composed of an array of comparators each equipped with a status latch function as shown in Fig. A.1(a). The multiple analog input voltages representing the distance values are compared with a common reference voltage which is ramped up from 0 to  $V_{DD}$  as a function of time. The comparator receiving the smallest input voltage is turned on first, followed by the one with the second smallest voltage, and then the one with the third smallest voltage, and so on. In this manner, the magnitudes of the input voltages are converted to turn-on timing signals, i.e., to time-domain data. The first turn-on signal is detected by a global OR circuit which receives all signals from the comparators. The output signal of the OR is fed back to all comparators to latch the output status at the moment of the first turn on. Then the latched status signal "1" identifies the



Figure A.1: Conventional WTA circuits: (a) voltage mode WTA circuit employing an array of vMOS comparators and a multiple-input OR for feedback; (b) open-loop delay time detector utilizing SR flip-flop tree architecture.

location of the winner (the one with the minimum distance). Due to the signal propagation delay present in the feedback loop via the OR, however, multiple comparators may have the chance to turn on before the feedback signal arrives. This causes the problem of winner discrimination accuracy. The accuracy can be enhanced by employing a very slow ramp rate, but at the cost of a long search time. Namely, the accuracy and speed for winner search are in a trade-off relationship due to the presence of the feedback signal.

An open-loop architecture for a WTA circuit working in the time domain was first proposed in [70], where the Hamming distance between an input binary code and a binary template code was utilized as the dissimilarity measure (See Fig. A.1(b)). The exclusive OR of each bit comparison digitally controls the delay in the respective stage in an inverter chain, thus converting the Hamming distance to the delay of a propagating pulse in the inverter chain. Then the first arrival pulse is detected as the winner signal. Winner detection is carried out using a binary tree of RS flip-flops as shown in the figure. At each flip-flop, one of the two outputs is selectively activated by the first arrival pulse, and the winner pulse propagates through the flip-flop tree, thus memorizing the winner location as a binary flag pattern in the flip-flop tree. Since this is an open-loop configuration, there is no problem due to the delay time in a feedback loop.

The purpose of this study is to develop a high-performance voltage-mode WTA circuit operating in the time domain. The ramp-voltage technique was employed for voltage-to-delay-time conversion because input signals are not digital Hamming distances but analog Manhattan or Euclidian distances. For delay detection, the open-loop binary tree architecture was employed. The system organization and the circuit details of the present WTA are described in Section A.2. In Section A.3, the experimental results of a fabricated chip are presented and discussed. Finally, the conclusions are given in Section A.4.

## A.2 System Organization and Circuit Details

#### A.2.1 System Organization



Figure A.2: Time-domain WTA circuit employing open-loop binary tree architecture developed in this study.

The block diagram of the WTA circuit developed in this work is shown in Fig. A.2. The circuit consists of a voltage-to-delay-time converter using the ramp voltage, a winner-detection binary tree and a winner address encoder. The inputs to the system are analog voltages and the output is a digital code representing the address of the winner location.

## A.2.2 Winner-detection binary tree

As shown in Fig. A.2, the winner-detection binary tree is composed of two-input OR circuits cascaded in a binary-tree structure. A flip-flop is attached parallel to each of the OR circuits. The flip-flop is employed only for comparing the timing difference in arrival between the two input signals and for storing the result. As already pointed out in [71], the replacement of the flip-flop



Figure A.3: Schematic of two-input delay-detection unit composed of a flip-flop and an OR circuit.

tree in Fig. A.1(b) by a simple OR tree can resolve the problem of the metastable state occurring in a flip-flop when the two input signals arrive almost simultaneously. Since the timing signal is directly propagating to the next stage through the OR, no problem occurs in the propagation of the winner signal even if the flip-flop enters a metastable state. The schematic of the two-input delay-detection unit employed in the winner-detection OR tree is illustrated in Fig. A.3. It is composed of a flip-flop circuit and an OR circuit. Dynamic logic gates are employed in order to realize a quick response for timing signals. In addition, an NMOS transistor T3 is inserted at the bottom of the OR circuit. The total sink current in the OR depends on whether either one or both of the NMOS T1 and T2 are activated. Variation in the sink current causes dispersion of the delay time in the OR. Therefore, the transistor T3 is used to limit the sink current and reduces the variation in the logic setup time. The variation in the logic setup time was evaluated by gate-level simulation using the  $0.6\mu m$  CMOS technology. When the gate size (W/L) of T3 is  $2.4\mu m/0.6\mu m$  and those of T1 and T2 were the same at  $4.8\mu m/0.6\mu m$ , the spread in the distribution of the logic setup time in the dynamic

OR ( $\sim$  17 ps) was half as small as that in the normal CMOS logic OR ( $\sim$  33 ps), where the size of the NMOS transistors was  $4.8\mu\text{m}/0.6\mu\text{m}$ . On the other hand, the most important concern of the flip-flop is not the circuit speed, but the sensitivity for detecting the difference between two input voltages. Therefore, all transistors in the flip-flop were designed using the minimum feature size to achieve a higher sensitivity by reducing the gate capacitance. In addition, special caution was taken in the flip-flop pattern layout to make it as symmetric as possible for the two input signals.

#### A.2.3 Winner address encoder

When the winner signal has propagated through the winner-detection OR tree, the winner address encoder starts to generate the binary address representing the location of the winner. The block diagram of the circuit is illustrated in Fig. A.4 Two-input encoder units are connected in the shape of a binary tree, which is a mirror image of the winner-detection OR tree. The two output flags of the flip-flop in the winner-detection OR tree are transferred to the input Flag0 and Flag1 of the corresponding two-input encoder unit. At first, the signal " $Encode\ Start$ " is given to the two-input encoder unit in the first stage. Then one of the two-input encoders in the next stage is selectively activated depending on the input flag status in the first stage. In this manner, the path of the winner signal propagation in the winner-detection OR tree is traced back in the encoder circuit. The nodes "Address" and "Address" are precharged to  $V_{DD}$  before encoding. According to the input flags, the NMOS transistor T0 or T1 pulls down one of the precharged nodes of "Address" and "Address" and "Address" and the location of the winner is encoded into a binary address code.

#### A.2.4 Voltage-to-delay-time converter

The configuration of the voltage-to-delay-time converter is depicted in Fig. A.5, which employs an array of chopper comparators and a ramp-signal generator. The chopper comparator is composed of four-stage inverters where auto-zeroing reset switches are equipped in the first and the second inverters. During the sample period, the input analog voltage VIN is stored in the capacitor C1, while



Figure A.4: Configuration of winner address encoder.



Figure A.5: Circuit diagram of voltage-to-delay-time converter.

the inverters are placed in the auto-zeroing step by closing the switches Reset1 and Reset2. After sampling the analog input voltages, the magnitude of the voltage is converted to the difference in the timing signal utilizing the ramp-up voltage. The operational principle is the same as that described in ref. 13, but the comparator configuration is different. The ascending ramp voltage Vramp is generated internally by the ramp signal generator utilizing a current source, which gradually charges up capacitor C1's in all the comparators. Therefore, by controlling the drivability of the current source, the ramp-up rate can be changed in the winner search. For example, the ramp-up rate of 100 mV/ns converts the voltage difference of 100 mV to the timing gap of 1 ns. A lower ramp-up rate increases voltage discrimination accuracy provided that the time resolution of the winner-detection tree is the same. In this design, the ramp-up rate changes in the range of  $0 \sim 200 \text{ mV/ns}$  with an external control signal.

### A.3 Results and Discussion

#### A.3.1 Chip fabrication



Figure A.6: Micrograph of the test chip and the pattern layout of the new WTA circuit.

The test chip was designed and fabricated using the double-poly, triple-metal 0.6µm CMOS technology. The chip includes a 64 input WTA circuit. A micrograph of the test chip is given in Fig. A.6. The layout area of the circuit is 3.2mm×1.0mm. It works at the power supply of 5V. An example of the test chip measurements is demonstrated in Fig. A.7, where the output signals from the flip-flops in the winner-detection OR tree are observed. (The ramp-up rate was set at 105mV/ns, and the voltage difference between the winner and loser was set at 1.00V)

#### A.3.2 Two-input delay-detection unit

The performance of the two-input delay-detection unit in Fig. A.3 was evaluated by circuit simulation and the results are shown in Fig. A.8 in comparison with the RS flip-flop that was designed to fit the layout of our WTA circuit. The simulation was conducted on the post-layout extracted circuits. The figure shows the waveforms observed at the OR output (the signal Next) in



Figure A.7: Measured waveforms of the test chip.





Figure A.8: Results of post-layout extracted circuit simulation of the two-input delay-detection unit. The timing difference between IN0 and IN1 was varied from 0 to 300 ps with a 5 ps step (a). The output characteristics of the two-input delay-detection unit are compared with those of the RS flip-flop (b).

Table A.1: Time resolution comparison between open-loop-type and feedback-type WTA circuits.

|                        | Open-loop-type WTA (present work) | Feedback-type WTA (Fig. A.1(a)) |
|------------------------|-----------------------------------|---------------------------------|
|                        |                                   |                                 |
| Post-layout simulation | 338 ps                            | 3,300 ps                        |
| Measured result        | 400 ps                            | -                               |

the two-input delay-detection unit, and the output of the RS flip-flop. In the simulation, the timing difference ( $\Delta t$ ) between the two input signals changed from 0 to 300ps with a 5ps step as shown in Fig. A.8(a), and the variations in the output setup characteristics are compared in Fig. A.8(b). The RS flip-flop immediately latches the signal when the timing difference  $\Delta t$  is larger than 200ps. However, the output delay time increases according to the decrease in  $\Delta t$ . For  $\Delta t = 0$ , the latching of the flip-flop still occurs due to the slight difference in the parasitic capacitances arising from the asymmetry still remaining in the layout pattern. Therefore, this is the worst-case simulation. The delay time increases to as large as 400ps for the flip-flop to latch. In the two-input delay-detection unit, on the other hand, the output signal Next settles fastest when the two input signals arrive simultaneously since both transistors T1 and T2 are turned on simultaneously to discharge the precharged node. When only one input signal arrives, the output signal Next settles slower. However, the time dispersion of the delay time in the signal Next is only 17ps due to the current limiting effect of T3 (See Fig. A.3).

### A.3.3 64-input WTA circuit

Table A.1 shows the time-resolution comparison between the open-loop-type WTA circuit (present work) and the conventional feedback-type WTA circuit (the one shown in Fig. A.1(a)). The number of input voltages is 64 for both circuits. In the post-layout simulation of the feedback-type WTA circuit, the ramp-up rate was set at 50 mV/ns. The input voltage  $V_{IN0}$  was set at 2.5V, and other input voltages were set at a slightly larger value than  $V_{IN0}$ . This difference in the input voltages was increased until only one comparator latched the winner signal. In this manner, the minimum volt-

age difference for correct winner detection was determined to be 165mV, which is equivalent to the time resolution of 3.3ns. On the other hand, in the simulation of the open-loop-type WTA circuit, the input voltage  $V_{IN0}$  was set at 2.5V,  $V_{IN32} \sim V_{IN64}$  at a slightly larger value than  $V_{IN0}$ , and the rest at VDD. As mentioned in Section A.3.2, the signal propagation through the OR tree is fastest when input signals arrive simultaneously. Therefore, this condition is the worst case for winner detection in the OR tree. For the ramp rate of 50mV/ns, a 16.9mV voltage difference between  $V_{IN0}$  and  $V_{IN32} \sim V_{IN63}$  is sufficient for the open-loop-type WTA circuit to identify  $V_{IN0}$  as the winner. Therefore, the winner-detection OR tree achieves a time resolution as small as 338ps. This simulation result was compared with the experimental data. The measured result for open-loop-type WTA circuit demonstrates a time resolution of 400ps, which was calculated from the comparison with the voltage differences between 2.000V (at  $V_{IN32} \sim V_{IN64}$ ) and 1.988 V (at  $V_{IN0}$ ) at a ramp-up rate of 30mV/ns.

The timing chart of the total circuit operation is illustrated in Fig. A.9, where typical time values are shown as determined by circuit simulation. A system reset time of  $\sim$ 100ns is required for auto-zeroing of chopper comparators composed of two-staged CMOS inverters. 100ns is sufficient for resetting the OR tree, the address encoder, and the ramp signal generator. After the winner detection by a ramp signal, the output signal from the winner-detection binary tree appears with a typical delay time of 5ns. The output signal is fed to the winner address encoder at node "Encode Start" (See Fig. A.4), thus activating the encoding operation. This takes about 10ns. When the address encoding is finished, the reset operation for the subsequent search can be initiated. In this manner, the total period required for the winner detection is about  $200 \sim 300$ ns depending on the time for the winner search period. Power dissipation of the circuit is mostly occurs in the chopper comparators due to the short-circuit currents flowing in the CMOS inverters during the auto-zeroing. This can be reduced by selecting smaller aspect ratios (L/W) for inverter transistors at the cost of slower operation. Design optimization in this regard was not conducted in the present work because the main interest of this work is the open-loop architecture after the voltage-to-time conversion. A



Figure A.9: Timing chart of the total circuit operation.

possible solution to this issue is to employ the charge-transfer preamplifier concept, as presented in [72].

### A.4 Conclusions

A high-performance WTA circuit was developed based on the time-domain winner search regime. By employing the open-loop OR-tree architecture, the problem of multiple winner detection due to the feedback signal delay was eliminated. Experimental results obtained from the fabricated test chip showed that the circuit achieves a time resolution of 400ps. The post-layout simulation confirms that the resolution is 9.5 times higher than that of the conventional WTA circuit utilizing a

feedback scheme via multiple-input OR.

# References

- [1] G. E. Moore, "Progress in digital integrated electronics," in *Proceedings of IEEE International Electron Devices Meeting (IEDM)*, 1975, pp. 11–13.
- [2] H. S. Stone, "A logic-in-memory computer," *IEEE Transactions on Computers*, vol. C-19, no. 1, pp. 73–78, Jan. 1970.
- [3] M. Mahowald, "Silicon retina with adaptive photodetectors," *Proceedings of SPIE, Visual Information Processing: From Neurons to Chips*, vol. 1473, pp. 52–58, 1991.
- [4] J. L. Hennessy and D. A. Patterson, Computer Organization and Design. San Francisco, CA: Morgan Kaufmann Publishers, 1997.
- [5] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, "Intelligent RAM (IRAM): Chips that remember and compute," in *Digest of Technical Papers of 1997 IEEE International Solid-State Circuits Conference (ISSCC 1997)*, San Francisco, CA, Feb. 1997.
- [6] N. Yamashita, T. Kimura, Y. Fujita, Y. Aimoto, T. Manabe, S. Okazaki, K. Nakamura, and M. Yamashita, "A 3.84 GIPS integrated memory array processor with 64 processing elements and a 2-Mb SRAM," *IEEE Journal of Solid-State Circuits*, vol. 29, no. 11, pp. 1336–1343, Nov. 1994.

- [7] P. P. Jonker, "Why linear arrays are better image processors," in *Proc. 12th IAPR International Conf. on Pattern Recognition*, vol. III, Jerusalem, Israel, Oct.9–13 1994, pp. 334–338.
- [8] S. Kyo, T. Koga, S. Okazaki, R. Uchida, S. Yoshimoto, and I. Kuroda, "A 51.2GOPS scalable video recognition processor for intelligent cruise control based on a linear array of 128 4-way VLIW processing elements," in *Dig. Tech. Papers 2003 IEEE Int. Solid-State Circuit Conf.* (ISSCC2003), San Francisco, CA, Feb. 2003.
- [9] D. W. Blevins, E. W. Davis, R. A. Heaton, and J. H. Reif, "BLITZEN: A highly integrated massively parallel machine," in *Proc. 2nd Symp. Frontier of Massively Parallel Communication*, Oct.12 1988.
- [10] D. G. Elliott, W. M. Snelgrove, and M. Stumm, "Computational ram: A memory-SIMD hybrid and its application to DSP," in *Proc. IEEE 1992 Custom Integrated Circuit Conf.*, Boston, U.S.A., May 1992, pp. 30.6.1–30.6.4.
- [11] J. C. Gaelow, and C. G. Sodini, "A pixel-parallel image processor using logic pitch-matched to dynamic memory," *IEEE Journal of Solid-State Circuits*, vol. 34, no. 6, pp. 831–839, June 1999.
- [12] V. Cantoni and L. Lombardi, "Hierarchical architectures for computer vision," in *Proceedings* of 3rd Euromicro Workshop on Parallel and Distributed Processing, Jan. 25–27 1995, pp. 392–398.
- [13] D. W. Hillis, The Connection Machine. MIT Press, 1986.
- [14] C. A. Mead and M. A. Mahowald, "A silicon model of early visual processing," *Neural Networks*, vol. 1, pp. 91–97, 1988.
- [15] C. Mead, Analog VLSI and neural systems. Boston: Addison-Wesley, 1989.
- [16] C. Koch and H. Li, VISION CHIPS: Implmenting Vision Algorithms with Analog VLSI Circuits. Los Alamitos, CA: IEEE Computer Society Press, 1995.

- [17] R. Etienne-Cummings, J. V. der Spiegel, and P. Mueller, "A focal plane visual motion measurement sensor," *IEEE Transactions on Circuits and Systems—Part 1: Fundamental Theory and Applications*, vol. 44, no. 1, pp. 55–66, Jan. 1997.
- [18] R. Etienne-Cummings, Z. K. Kalayjian, and D. Cai, "A programmable focal-plane MIMD image processor chip," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 1, pp. 64–73, Jan. 2001.
- [19] V. Brajovic and T. Kanade, "Computational sensor for visual tracking with attention," *IEEE Journal of Solid-State Circuits*, vol. 33, no. 8, pp. 1199–1207, Aug. 1998.
- [20] —, "A VLSI sorting image sensor: Global massively parallel intensity-to-time processing for low-latency adaptive vision," *IEEE Transactions on Robotics and Automation*, vol. 15, no. 1, pp. 67–75, Feb. 1999.
- [21] S. Mehta and R. Etienne-Cummings, "Normal optical flow chip," in *Proc. 2003 IEEE International Symposium of Circuits and Systems (ISCAS'03)*, vol. IV, Bangkok, Thailand, May25 28 2003, pp. 784–787.
- [22] —, "Normal optical flow measurement on a CMOS APS imager," in *Proc. 2004 IEEE International Symposium of Circuits and Systems (ISCAS'04)*, vol. IV, Vancouver, Canada, May23 26 2004, pp. 848–851.
- [23] R. Dominguez-Castro, S. Espejo, A. Rodriguez-Vazquez, R. A. Carmona, P. Foldesy, A. Zarandy, P. Szolgay, T. Sziranyi, and T. Roska, "A 0.8- m CMOS two-dimensional programmable mixed-signal focal-plane array processor with on-chip binary imaging and instructions storage," *IEEE Journal of Solid-State Circuits*, vol. 32, no. 7, pp. 1013–1026, July 1997.
- [24] G. Cembrano, A. Rodriguez-Vazque, R. Galan, F. Jimenez-Garrido, S. Espejo, and R. Dominguez-Castro, "A 1000 FPS at 128×128 vision processor with 8-bit digitized I/O," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 7, pp. 1044–1055, July 2004.

- [25] P. Dudek and P. J. Hicks, "A CMOS general-purpose sampled-data analog processing element," *IEEE Transactions on Circuits and Systems—Part II: Analog and Digital Signal Processing*, vol. 47, no. 5, pp. 467–473, May 2000.
- [26] ——, "A general-purpose processor-per-pixel analog SIMD vision chip," *IEEE Transactions on Circuits and Systems—Part I: Fundamental Theory and Applications*, vol. 52, no. 1, pp. 13–20, Jan. 2005.
- [27] M. Ishikawa, K. Ogawa, T. Komuro, and I. Ishii, "A CMOS vision chip with SIMD processing element array for 1ms image processing," in *Digest of Technical Papers of 1999 IEEE Interna*tional Solid-State Circuits Conference (ISSCC'99), San Francisco, CA, Feb.15–17 1999, pp. 206–207.
- [28] M. Ishikawa and T. Komuro, "Digital vision chips and high-speed vision systems," in *Digest of Technical Papers of Symposium on VLSI Circuits*, Kyoto, Japan, June14–16 2001, pp. 1–4.
- [29] T. Komuro, I. Ishii, M. Ishikawa, and A. Yoshida, "A digital vision chip specialised for high-speed target tracking," *IEEE Transactions on Electron Devices*, vol. 50, no. 1, pp. 191–199, Jan. 2003.
- [30] M. Ishikawa, K. Ogawa, T. Komuro, and I. Ishii, "A dynamically reconfigurabile SIMD processor for a vision chip," *IEEE Journal of Solid-State Circuits*, vol. 39, no. 1, pp. 265–268, Jan. 2004.
- [31] R. Cypher and J. L. C. Sanz, "SIMD architectures and algorithms for image processing and computer vision," *IEEE Transactions on Acoustics, Speech, and Signal Processing*, vol. 37, no. 12, pp. 2158–2174, Dec. 1989.
- [32] A. Gentile and D. S. Wills, "Impact of pixel per processor ratio on embedded simd architecture," in 11th International Conference on Image Analysis and Processing (ICIAP'01), Palermo, 26–28 Sept. 2001, pp. 204–208.

- [33] Y. Nakashita, Y. Mita, and T. Shibata, "Analog edge-filtering processor employing only-nearest-neighbor interconnects," *Japanese J. of Appl. Phys.*, vol. 44, no. 4B, pp. 2119–2124, Apr. 2005.
- [34] A. E. Gamal, D. Yang, and B. Fowler, "Pixel level processing why, what, and how?" in *Proceedings of the SPIE, Vol. 3650*, San Jose, CA, Jan.27–28 1999, pp. 2–13.
- [35] M. Yagi, M. Adachi, and T. Shibata, "A hardware-friendly soft-computing algorithm for image recognition," in *Proc. 10th European Signal Processing Conf. (EUSIPCO 2000)*, Tampere, Finland, Sept.4–8 2000, pp. 729–732.
- [36] M. Yagi and T. Shibata, "An image representing algorithm compatible with neural-associative-processor-based hardware recognition systems," *IEEE Transactions on Neural Networks*, vol. 14, no. 5, pp. 1144–1161, Sept. 2003.
- [37] T. M. Le, W. M. Snelgrove, and S. Panchanathan, "SIMD processor arrays for image and video processing: A review," in *Multimedia Hardware Architectures*, *Proc. SPIE*, vol. 3311, 1998, pp. 30–41.
- [38] R. Cypher and J. L. C. Sanz, "SIMD architectures and algorithms for image processing and computer vision," *IEEE Transactions on Acoustics, Speech, and Signal Processing*, vol. 37, no. 12, pp. 2158–2174, Dec. 1989.
- [39] D. W. Hammerstrom and D. P. Lulich, "Image processing using one-dimensional processor array," *Proceedings of the IEEE*, vol. 84, no. 7, pp. 1005–1018, July 1996.
- [40] C.-W. Yoon, R. Woo, J. Kook, S.-J. Lee, K. Lee, and H.-J. Yoo, "An 80/20-MHz 160-mW multimedia processor integrated with embedded DRAM, MPEG-4 accelerator, and 3-D rendering engine for mobile applications," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 11, pp. 1758–1767, Nov. 2001.

- [41] E. R. Fossum, "CMOS image sensors: Electronic camera-on-a-chip," *IEEE Transactions on Electron Devices*, vol. 44, no. 10, pp. 1689–1698, Oct. 1997.
- [42] B. Fowler, A. E. Gamal, and D. X. D. Yang, "A CMOS area image sensor with pixel-level A/D conversion," in *Digest of Technical Papers of 1994 IEEE International Solid-State Circuits Conference (ISSCC94)*, San Francisco, CA, Feb.16–18 1994, pp. 226–227.
- [43] D. X. D. Yang, B. Fowler, and A. E. Gamal, "A nyquist-rate pixel-level ADC for CMOS image sensors," *IEEE Transactions on Systems Science and Cybernetics*, vol. 34, no. 3, pp. 348–356, Mar. 1999.
- [44] F. Andoh, H. Shimamoto, and Y. Fujita, "A digital pixel image sensor for real-time readout," *IEEE Transactions on Electron Devices*, vol. 47, no. 11, pp. 2123–2127, Nov. 2000.
- [45] D. Yang, A. E. Gamal, B. Fowler, and H. Tian, "A 640\*512 CMOS image sensor with ultra wide dynamic range floating-point pixel-level ADC," *IEEE Journal of Solid-State Circuits*, vol. 34, no. 12, pp. 1821–1834, Dec. 1999.
- [46] L. G. McIlrath, "Low-power low-noise ultrawide-dynamic-range CMOS imager with pixel-parallel a/d conversion," *IEEE Transactions on Systems Science and Cybernetics*, vol. 36, no. 5, pp. 846–853, May 2001.
- [47] A. Kitchen, A. Bermak, and A. Bouzerdoum, "A digital pixel sensor array with programmable dynamic range," *IEEE Transactions on Electron Devices*, vol. 52, no. 12, pp. 2591–2601, Dec. 2005.
- [48] S. Kleinfelder, S. Lim, X. Liu, and A. E. Gamal, "A 10,000 frames/s CMOS digital pixel sensor," *IEEE Journal of Solid-State Circuits*, vol. 36, no. 12, pp. 2049–2059, Dec. 2001.
- [49] W. Bidermann, A. E. Gamal, S. Ewedemi, J. Reyneri, H. Tian, D. Wile, and D. Yang, "A 0.18μm high dynamic range NTSC/PAL imaging system-on-chip with embedded DRAM

- frame buffer," in *Digest of Technical Papers of 2003 IEEE International Solid-State Circuits Conference*, San Fransisco, CA, Feb. 2003, pp. 212–213, 488.
- [50] B. K. Kar and D. K. Pradhan, "A new algorithm for order statistic and sorting," *IEEE Transactions on Signal Processing*, vol. 41, no. 8, pp. 2688–2694, Aug. 1993.
- [51] K. Ito, M. Ogawa, and T. Shibata, "A variable-kernel flash-convolution image filtering processor," in *Digest of Technical Papers of 2003 IEEE International Solid-State Circuits Conference (ISSCC 2003)*, San Francisco, CA, Feb.9–13 2003, pp. 470–471.
- [52] C. M. Higgins, R. A. Deutschmann, and C. Koch, "Pulse-based analog 2-D motion sensors," *IEEE Transactions on Circuits and Systems—Part II: Analog and Digital Signal Processing*, vol. 46, no. 6, pp. 677–687, June 1999.
- [53] A. Iwata and M. Nagata, "A concept of analog-digital merged circuit architecture for future VLSI's," *IEICE Trans. Funtamentals*, vol. E79-A, no. 2, pp. 145–157, Feb. 1996.
- [54] B. K. Horn and B. G. Schunck, "Determining optical flow," *Artificial Intelligence*, vol. 17, pp. 185–203, 1981.
- [55] M. Nagata, J. Funakoshi, and A. Iwata, "A PWM signal processing core circuit based on a switched current integration technique," *IEEE Journal of Solid-State Circuits*, vol. 33, no. 1, pp. 53–60, Jan. 1998.
- [56] M. Nagata, M. Homma, N. Takeda, T. Morie, and A. Iwata, "A smart CMOS imager with pixel level PWM signal processing," in *Digest of Technical Papers of 1999 Symposium on VLSI Circuits*, Kyoto, Japan, June17–19 1999, pp. 141–144.
- [57] A. Gersho and R. M. Gray, *Vector Quantization and Signal Compression*. Boston: Kluwer Academic Publishers, 1992.
- [58] T. Kohonen, Self-Organization Maps. Berlin: Springer, 1995.

References 107

[59] T. Shibata. "Intelligent VLSI systems based on a psychological brain model," in *Proceedings* of 2000 IEEE International Symposium on Intelligent Signal Processing and Systems (ISPACS 2000), Honolulu, Hawaii, U. S. A., Nov. 5–8 2000, pp. 323–332.

- [60] K. Tsang and B. W. Y. Wei, "A VLSI architecture for a real-time code book generator and encoderof a vector quantizer," *IEEE Transactions on Very Large Scale Integration (VLSI) Sys*tems, vol. 2, no. 3, pp. 360–364, Sept. 1994.
- [61] T. Shibata, A. Nakada, T. Morimoto, T. Ohmi, H. Akutsu, A. Kawamura, and K. Marumoto, "A fully-parallel vector quantization processor for real-time motion picture compression," in *igest of Technical Papers*, 1997 IEEE International Solid-State-Circuit Conference (ISSCC1997), no. FP16.9, San Francisco, CA, Feb. 1997.
- [62] A. Nakada, T. Shibata, M. Konda, T. Morimoto, and T. Ohmi, "A fully parallel vector-quantization processor for real-timemotion-picture compression," *IEEE Transactions on Systems Science and Cybernetics*, vol. 34, no. 6, pp. 822–830, June 1999.
- [63] G. T. Tuttle, S. Fallahi, and A. A. Abidi, "An 8b CMOS vector AID converte," in *Digest of Technical Papers*. 40th IEEE International Solid-State Circuits Conference (ISSCC1993), San Francisco, CA, Feb. 24–26 1993, pp. 38–39.
- [64] A. Nakada, M. Konda, T. Morimoto, T. Yonezawa, T. Shibata, and T. Ohmi, "Fully-parallel VLSI implementation of vector quantization processor using neuron-mos technology," *IEICE Transactions on Electronics*, vol. E82-C, no. 9, pp. 1730–1737, 1999.
- [65] J. Lazzaro, S. Ryckebusch, M. Mahowald, and C. Mead, "Winner-take-all networks of O(N) complexity," in *Advances in Neural Information Processing Systems 1*, D. S. Touretzky, Ed. Morgan Kaufmann, 1989, pp. 703–711.
- [66] J. Choi and B. J. Sheu, "A high-precision VLSI winner-take-all circuit for self-organizing

- neural networks," *IEEE Journal of Solid-State Circuits*, vol. 28, no. 5, pp. 576–584, May 1993.
- [67] G. Cauwenberghs and V. Pedroni, "A charge-based CMOS parallel analog vector quantizer," in *Neural Information Processing Systems 7*, D. Touretzky and T. Leen, Eds. Cambridge, MA, 1995: MIT Press, 1995.
- [68] T. Shibata and T. Ohmi, "A functional MOS transistor featuring gate-level weighted sum andthreshold operations," *IEEE Transactions on Electron Devices*, vol. 39, no. 6, pp. 1444– 1455, June 1992.
- [69] T. Yamashita, T. Shibata, and T. Ohmi, "Neuron MOS winner-take-all circuit and its application to associative memory," in *1993 IEEE International Solid-State Circuits conference* (ISSCC1993), San Francisco, CA, Feb. 24–26 1993, pp. 236–237, 294.
- [70] M. Ikeda and K. Asada, "Time-domain minimum-distance detector and its application to low power coding scheme on chip interface," in *Proceedings of the 24th European Solid-State Circuits Conference(ESSCIRC'98)*, Sept. 22–24 1998, pp. 464–467.
- [71] ——, "時間領域を用いた最短距離探索機構を有するメモリマクロブロック構築事例," in *Proceedings of DA Symposium 2000 (in Japanese)*, July 2000, p. 139.
- [72] K. Kotani, T. Shibata, and T. Ohmi, "CMOS charge-transfer preamplifier for offset-fluctuationcancellation in low-power a/d converters," *IEEE Journal of Solid-State Circuits*, vol. 33, no. 5, pp. 762–769, May 1998.

# **List of Publications**

#### **Journal Paper**

- K. Ito. M. Ogawa and T. Shibata, "A High-Performance Ramp-Voltage-Scan Winner-Take-All Circuit in an Open Loop Architecture," *Japanese Journal of Applied Physics*, Vol. 41, Part 1, No. 4B, pp. 2301-2305, April 2002.
- 2. <u>K. Ito</u>, M. Ogawa and T. Shibata, "A Variable-Kernel Flash-Convolution Image Filtering Processor," submitted to *IEEE Journal of Solid-State Circuits*.

#### **Refereed Papers at International Conference**

- K. Ito, M. Ogawa and T. Shibata, "A High-Performance Time-Domain Winner-Take-All Circuit Employing OR-Tree Architecture," in Extended Abstracts, the 2001 International Conference on Solid State Devices and Materials (SSDM 2001), pp. 94-95, Tokyo, Sep. 26-28, 2001.
- M. Ogawa, <u>K. Ito</u>, and T. Shibata, "A General-Purpose Vector-Quantization Processor Employing Two-Dimensional Bit- Propagating Winner-Take-All," in Digest of Technical Papers of 2002 Symposium on VLSI Circuits, pp. 244-247, Honolulu, June 13-15, 2002.
- 3. K. Ito, M. Ogawa, T. Shibata, "A Variable-Kernel Flash- Convolution Image Filtering Processor," in Digest of Technical Papers, 2003 IEEE International Solid-State-Circuit Conference (ISSCC2003), Paper No. 26.7, pp. 470-471, Sun Francisco, February, 2003.
- B. Tongprasit, <u>K. Ito</u> and T. Shibata, "A Computational Digital-Pixel-Sensor VLSI Featuring Block-Readout Architecture for Pixel-Parallel Rank-Order Filtering," in *Proceedings of the* 2005 International Symposium on Circuits and Systems (ISCAS'05), Vol. 3, Kobe, Japan, May 22-26, 2005, pp. 2389-2392.

List of Publications

 K. Ito and T. Shibata, "A Time-Domain Gradient-Detection Architecture for VLSI Analog Motion Sensors," in Proceedings of the 2006 International Symposium on Circuits and Systems (ISCAS'06), Island of Kos, Greece, May 21-24, 2006, pp. 201-204.

- 6. <u>K. Ito</u> and T. Shibata, "A Mixed-Signal Focal-Plane Image Processor Employing Time-Domain Computation Technique," submitted to *The 2007 Symposium on VLSI Circuits*.
- 7. H. Shikano, **K. Ito** and T. Shibata, "K-means Learning Processor with Automatic Seeds Generation Based on Parallel Vector Quantization Architecture," submitted to *The 2007 Symposium on VLSI Circuits*.

#### Papers at Domestic Conferences (in Japanese)

- 1. 小川誠, <u>伊藤潔人</u>, 柴田直、「2 次元ビットプロパゲーション WTA を用いた汎用 VQ プロセサ」, 電子通信学会技術研究報告、(集積回路研究専門委員会 (ICD)), 2002 年 12 月.
- 2. <u>伊藤潔人</u>, 小川誠, 柴田直, 「フラッシュコンボルーション型画像フィルタ演算プロセサ」, 電子情報通信学会技術研究報告, (集積回路研究専門委員会 (ICD)), 2003 年 5 月.
- 3. 森川 重毅, <u>伊藤潔人</u>, 柴田直、「 K-means VLSI プロセッサと画像の自己領域分化への応用」, 電子情報通信学会ニューロコンピューティング研究会, 2006 年 11 月.

#### **Awards**

- 1. (財) 電気・電子情報学術振興財団 猪瀬学生奨励賞 (2002年).
- 2. 第6回 LSI IP アワード IP 賞 (2003 年 4 月).

#### **Patents**

1. 小川誠, 伊藤潔人, 柴田直, 「画像処理装置及び画像処理方法」特願 2003-031569.