# IMAP-VISION: An SIMD Processor with High-Speed On-chip Memory and Large Capacity External Memory

Yoshihiro Fujita Nobuyuki Yamashita Shin'ichiro Okazaki \* Information Technology Research Laboratory NEC Corporation

# Abstract

This paper describes a single-board real-time vision system with 256 SIMD processors and its performance. The newly developed LSI, IMAP-VISION, integrates 32 8-bit processors, 1 Kbyte/processor on-chip memories and an external memory interface with 160 Mbyte/s bandwidth on a single chip. With eight IMAP-VISION chips and eight 16 Mbit synchronous DRAMs as external memories, the system has peak performance of 10.24 GIPS as well as memory capacity to hold four 256×256 8-bit images in IMAP-VISION on-chip memories, and 256 images in external memories. Data transfer between on-chip and external memories can be carried out asynchronously and concurrently with computations to enable efficient use of external memories for memory intensive algorithms. Performance for several algorithms such as labeling, hough transform, optical flow, and stereo matching is shown to be much faster than the video rate.

## 1 Introduction

An SIMD processor array has been shown to be suitable for image processing[1]–[3]. We have proposed an integrated memory array processor (IMAP), which enabled the design of a compact and high performance system, and have built real-time vision systems[4]–[7]. Although the IMAP chip has high levels of computational ability by integrating one-dimensional SIMD processors and large capacity memories on a single chip, its on-chip memory capacity is sometimes insufficient for flexible execution of complicated algorithms.

We have developed a new version of IMAP chip, IMAP-VISION, which has a high bandwidth external memory interface with an efficient data transfer mechanism. Using the external memory, each processor can make use of high speed on-chip memory and large capacity external memory.

This paper describes architecture of the IMAP-VISION chip and the single-board real-time vision system, and discusses performance figures for some image processing algorithms.



Figure 1: IMAP-VISION chip and external memory block diagram

### 2 IMAP-VISION Chip Architecture

Figure 1 shows the IMAP-VISION chip configuration that integrates 32 8-bit processors and 32 8-Kbit SRAMs. Processors are connected in series to configure an one-dimensional SIMD processor array. Instructions for the processors are given from an external control processor. The IMAP-VISION specifications are listed in Table 1 and chip microphtograph is shown in Figure 2.

### 2.1 External Memory

In our previous IMAP chip, each processor has a 4 Kbyte on-chip memory[6]. The new IMAP-VISION chip, although each processor has only 1 Kbyte onchip memory, can access 64 Kbyte/processor external memory in a 16 Mbit Synchronous DRAM. Data is transferred as a 32 byte block data (1 byte/processor) with 160 Mbyte/s bandwidth. The data transfer can be issued asynchronously and concurrently with the computations.

Figure 3 shows overlap effect of data transfer between on-chip and external memories when  $3 \times 3$  averaging filter is applied to images stored in the on-

<sup>\*</sup>Address: 4-1-1 Miyazaki, Miyamae, Kawasaki 216, Japan. E-mail: fujita@pat.cl.nec.co.jp

| Process technology                  | $\begin{array}{c} 0.55 \ \mu m \ \mathrm{CMOS} \\ \mathrm{triple-layer \ metal} \\ 14.8 \times 14.8 \ \mathrm{mm} \end{array}$ |  |
|-------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--|
| Chip size                           |                                                                                                                                |  |
| Number of transistors               | logic: 596,052<br>memory: 1,728,000                                                                                            |  |
| Processor clock                     | 40 MHz                                                                                                                         |  |
| External memory clock               | 80 MHz                                                                                                                         |  |
| Peak performance                    | 1.28 GIPS<br>(8 bit operations)                                                                                                |  |
| On-chip memory capacity             | 256 kbit                                                                                                                       |  |
| External memory capacity            | 16 Mbit                                                                                                                        |  |
| On-chip memory<br>access bandwidth  | 1.28 Gbyte/s                                                                                                                   |  |
| External memory<br>access bandwidth | 160 Mbyte/s                                                                                                                    |  |
| Maximum power dissipation           | 1.6 W                                                                                                                          |  |
| Power supply                        | 3.3 V                                                                                                                          |  |
| Package                             | 208 pin QFP                                                                                                                    |  |
|                                     |                                                                                                                                |  |

Table 1: IMAP-VISION chip specifications



Figure 2: IMAP-VISION chip microphotograph

(a) Images stored in On-chip Memory

3×3 Averaging Filter

(b) Sequential Execution on Images stored in External Memory



140.7 µs

(c) Overlap Execution on Images stored in External Memory



Figure 3: Overlap effect of data transfer between on-chip and external memories

chip memory and the external memory.

It takes  $140.7\mu$ s to apply  $3\times3$  averaging filter to images stored in the on-chip memory (Figure 3a). Figure 3b shows sequential execution to images stored in the external memory and Figure 3c shows overlap execution. The overlap execution reduces data transfer overhead from 73% to 9%.

# 2.2 Bit-Pack Function

The IMAP-VISION chip uses 8-bit processors for computation, however, it is important in some applications to improve binary image processing performance. We implemented a bit-pack function which packs the binary values of eight-connected neighbor pixels into one byte of data within one clock cycle (Figure 4). Performance results for  $3 \times 3$  binary logical operation, thinning and labeling which make use of the bit-pack function are discussed in Section 4.



Figure 4: Bit-pack function

# 3 IMAP-VISION System

Figure 5 shows the IMAP-VISION system with the IMAP-VISION board installed in a host workstation. Figure 6 shows the IMAP-VISION board itself. The IMAP-VISION board integrates eight IMAP-VISION chips, eight 16-Mbit synchronous DRAMs, a control processor, video interface and VME64 interface using both sides of one doubleheight VME board.

Figure 7 shows a blockdiagram of the IMAP-VISION system which focuses on the control processor. The control processor consists of a sequencer, a



Figure 5: IMAP-VISION system



Figure 7: IMAP-VISION system configuration





16-bit processor, a 64 Kword program memory, a 64 Kword data memory and a bus controller.

The sequencer controls the program sequence. It changes the sequence when subroutine calls, interrupts, and branch instructions are issued.

The program memory holds a long instruction word program code which consists of a control processor instruction and an IMAP instruction. The sequencer fetches and broadcasts instructions to the IMAP processor array every clock cycle.

The 16-bit processor is used not only for address calculations in IMAP load/store instructions, but for execution of sequential algorithm which cannot be well mapped on the IMAP processor array.

The bus controller regulates memory access from a host workstation to all memories and block data transfer between on-chip and external memories. This controller allows the host workstation to access the data without collision as in its own memory, even when a real-time vision task is running.

By using eight IMAP-VISION chips connected in series, the board has a one-dimensional 256 processor array with a 10.24 GIPS peak performance. The on-chip and the external memory can holds 4 frames and 256 frames of  $256 \times 256$  pixel 8-bit image, respectively. Bandwidth between the on-chip and the external memory is 1.28 Gbyte/s which enables the transfer of 644 frames in 33.3 ms, as shown in the right side of Figure 7.

# 4 Performance

Table 4 shows image processing performance of the IMAP-VISION board. All algorithms are applied to  $256 \times 240$  pixel, 8-bit or binary images. The performance figures listed are tens to hundreds times faster than the video rate. The IMAP-VISION board has very high binary image processing performance, such as 0.045 ms for  $3 \times 3$  binary logical operation which makes use of the bit pack operation.

### 4.1 Optical Flow

Optical flow detection using  $5 \times 5$  block gradient based estimation algorithm[9] applied to every pixel takes 5.6 ms.

### 4.2 Thinning

Thinning also utilizes the bit pack operation. One iteration of thinning takes 0.36 ms. The worst case scenario is a binary image filled with 1's, which costs 21.6 ms. A typical image shown in Figure 8a with the thinning result in Figure 8b has a run-time of 4.5 ms.

|                                  | Execution time |                   |
|----------------------------------|----------------|-------------------|
| Algorithm                        | (ms)           | Ratio to<br>33 ms |
| binarization                     | 0.036          | 0.0011            |
| 3×3 binary logical operation     | 0.045          | 0.0014            |
| 3×3 laplacian filter             | 0.186          | 0.0056            |
| 3×3 convolution                  | 0.65           | 0.0196            |
| 3×3 median filter                | 1.07           | 0.0321            |
| 5×5 gaussian filter              | 0.39           | 0.0117            |
| intensity histogram              | 0.082          | 0.0025            |
| intensity histogram equalization | 0.26           | 0.0078            |
| 90-degree rotation               | 0.51           | 0.0153            |
| rotation                         | 2.5            | 0.075             |
| scaling                          | 1.68           | 0.0504            |
| $5 \times 5$ optical flow        | 5.6            | 0.168             |
| thinning (case of Figure 8)      | 4.5            | 0.135             |
| thinning (worst case)            | 21.6           | 0.648             |
| labeling (case of Figure 9)      | 2.41           | 0.0724            |
| labeling (worst case)            | 19.15          | 0.575             |
| hough transform (1339 points)    | 2.76           | 0.0828            |
| stereo matching (16 disparities) | 5.94           | 0.178             |

Table 2: Image processing performance

#### 4.3Labeling

We developed a labeling algorithm in which the input image is binarized and the regions are shrunk using bit pack operation. Regions are then traced and labeled by the control processor. The bit pack operation is also used in this tracing phase to decide a direction to trace. Labeling the image shown in Figure 9 which consists of 136 connected components takes 2.41 ms. The worst case scenario for this labeling algorithm has a run-time of 19.15 ms.

#### 4.4 Hough transform

Hough transform for 1399 point takes 2.76 ms. Figure 10 shows a road detection example including 1399 point hough transform. It took 5.01 ms to obtain the result shown in Figure 10b by applying: adaptive binarization, OR-operation, edge detection, hough transform voting, peak detection, and result line drawing.

#### 4.5Stereo matching

Stereo matching with 16 levels of disparities takes 5.94 ms. First,  $3 \times 3$  horizontal differential filter is applied to both left and right input images to detect vertical edges (0.272 ms). Then,  $5 \times 5$  block matching is carried out in 0.354 ms for each of 16 disparities.

#### 5 Conclusion

The IMAP-VISION chip architecture and the single-board real-time image processing system are described with specification figures such as data transfer bandwidth and memory capacity.

Performances for thinning, optical flow, labeling, hough transform, stereo and some other algorithms are shown to be much faster than the video rate.

Thus, IMAP-VISION system is a viable tool for a number of real-time image processing applications.







(a) Input Image

Figure 9: Labeling





(a) Input Image

(b) Result Image

Figure 10: Hough transform

# References

- [1] T. J. Fountain, et al., "The CLIP7A Image Processor," IEEE Trans. on Pattern Analysis & Machine Intelligence, Vol.10, No.3, pp. 310-319, May. 1988. L. A. Schmitt et al., "The AIS-5000 Parallel Processor,"
- IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.10, No.3, pp.320-330, May, 1988.
- [3] Juvin, d., Basille, J-L., Essafi, H., Latil, J-Y., "SYMPATI2: a 1.5 D Processor Array for Image Application," Signal Pro-cessing IV: Theories and Applications, Elsevier Science Publishers B.V. (North Holland), 1988.
- [4] Y. Fujita, et al., "Image Functional Memory: Architecture and Performance," Proc. of Workshop on Computer Architecture for Machine Perception, pp.13-24, Dec., 1991
- Y. Fujita, et al., "A real-time vision system using integrated [5] memory array processor prototype," Machine Vision and Applications, No.7, pp.220-228, 1994. N. Yamashita, et al., "A 3.84 GIPS Integrated Memory Ar-Machine Vision and
- ray Processor with 64 Processing Elements and a 2-Mb SRAM," IEEE J. of Solid-State Circuits, Vol.29, No.11, pp.1336-1343, Nov., 1994.
- [7] S. Okazaki, et al., "A Compact Real-time Vision System Using Integrated Memory Array Processor Architecture," IEEE Trans. on Circuits and Systems for Video Technology, Vol.5, No.5, pp.446-452, Oct., 1995.
- [8] N. Yamashita, et al., "An Integrated Memory Array Processor with a Synchronous DRAM interface for Real-Time Vision Applications," Proc. of International Conference on Pattern Recognition, Vol.IV, pp.575-580, 1996. N. Ohta, "Image Movement Detection with Reliability In-
- dices," IEICE Transactions, Vol.E, No.10, pp.3379-3388, 1991.