Professional Documents
Culture Documents
A 20-Kbit Associative Memory LSI For AI Machines
A 20-Kbit Associative Memory LSI For AI Machines
Ahstruct -A 20-kbit (512 wordsx 40 bits) CMOS associatite-menioq system. A high-performance Prolog machine using the
LSI is described. This LSI performs large-scale parallelism for highly 4-kbit CAM LSI was developed experimentally in 1986
efficient associative operation5 in artificial intelligence machines. Rela-
tional search, large-bit-length data treatment, and quick garbage collection ~31.
are realized o n the single-chip associative-memory LSI. A new cell arra) There remain, however. serious problems in broadening
structure has been designed in order to reduce the chip area. A newl) CAM LSI applications for artificial intelligence machines.
designed simple accelerator circuit allows for high-speed search operations. In actual data processing. various search functions, such as
This LSI is fabricated using 1.2-pm and double-aluminum-laqer CMOS less-than and greater-than search, are necessary. Moreover,
process technology. A total of 284000 devices have been integrated on a
5.3 X 7.9-mn? chip. The measured minimum q c l e time and power discipa-
large-bit-length data cannot be treated at high throughput
tion at 1 0 - M W operation are 85 ns and 250 mW, respectively. The in conventional CAM LSI's. Furthermore, in order to
associative memory described here, with its highly efficient associatite broaden CAM LSI applications, it is necessary to develop
operation capabilities, promises to be a large step toward the development a large-bit-capacity CAM LSI.
of high-performance artificial intelligence machines. In order to resolve these problems, the present study is
focused on the following two points. One is the CAM LSI
architecture for carrying out various search functions and
I. INTRODUCTION a large-bit-length data treatment. This large-bit-length data
treatment is referred to here as wide-band data processing.
Mode Operations
~ ~~
Tag
write-mask data corresponding to the four data parts and field field
eight tag bits. The write-mask and 40-bit search-mask data
.../--
, .~--
_-- ,-~-
~ ~- ~
Stored 1 0 Selected
. 7,I d a t a
I
1 1 0 I O 1 *
Cell
t t
Temporary
key datum
- TK3=$3
array ’,
4-9
I n p u t address
IOFFIi6
under the condition where
LSE position of the address
is masked
I yes
OR-Search by T K ’
1
less-than search execution speed is estimated to be O( nz).
where m is bit length, because the operation is executed i n
bit-serial and word-parallel manner. It should be noted
- 1 that relational search execution speed is not affected by
K Original key datum T K i Temporary key datum for i - t h iterative
the data volume
k, I - t h bit of original key datum I i = O M S B I search Operation
Vi I I I 101 I I Mi I O 01 I)
Fig. 3 . Bit-serial less-than search: ( a ) s h e m a t i c diagram and ( b ) opera- This CAM LSI has been designed to support associative
tion floa operation on excessive bit-length data of more than 40
bits. The 8-bit tag field is used for the wide-band data
processing. Data having bit length exceeding 40 bits can be
processor. This is done simply on the basis of logic opera- stored in two or more successive word locations in a folded
tions. In the less-than search operation, the temporary key form by using the tag. The tag indicates the following part
datum TK‘ for ith iterative search operation is generated of the data. In Fig. 2, up to 256 bits of data are handled in
by TK’ = K n V‘, as shown in Fig. 3. Here, K is the eight successive word locations.
original key datum, and V’ is (1 . . . 101 . . . 1). Wide-band data processing is accomplished by iteration
At the search operation for the most significant bit. of 32-bit data processing and communication of the pro-
stored datum (0111) is selected, as shown in Fig. 3(a). This cessing results between neighboring word locations. This
result indicates that the stored datum (0111) is less than communication is executed at all word locations in parallel
the original key datum (1010). In this way, the less-than and performed without any overhead time. The 32-bit data
search operation is executed in a serial manner until all processing. such as equal and relational search, is carried
possible temporary key data have been compared. The out by referencing the tag field. Consequently, 32 x n bits
average number of possible temporary key data is 16 for of large-bit-length data can be processed in k x n cycles
32-bit-length data. In Fig. 3(a), the final result indicates for a certain integer k . The integer k depends on the
that the stored data (0111) and (1001) are less than the content of the 32-bit data processing. In the wide-band
original key datum (1010). data processing, the number of cycles necessary for execu-
The CAM LSI “mask data write” operation can be tion is not affected by the data volume.
executed in parallel to the other operations-“shift and For instance, for 256-bit data where n = 8, the equal
jump” and “temporary key data generate”-under appro- search operation is completed in 1 x 8 = 8 cycles, where
priate system configurations (see the Appendix). When the k = 1, and the less-than search operation is completed in
ith bit of original key datum ki is ONE, the ith bit-serial an average of 48 X 8 = 384 cycles, where k = 48. This high
operation is completed in one cycle. If ki is ZERO, the ith throughput and the flexible expandability of the word
bit-serial operation requires two cycles. As a result, a configuration is an efficient way to process complicated
32-bit less-than search can be executed in an average of 48 structural data.
OGURA et al.:20-KBIT ASSOCIATIVE MEMORY LSI 1017
_ ,
32-bit data field , &bit tag field
--.
1st part 18bit) 4th part 18bit)
EXCLUSIVE-NOR
(b)
Fig. 5. New CAM cell array: (a) schematic diagram and (b) cell circuit. E : bit line, KD:key-data line, M L : match line, TP:
partial-WRITE signal line for tag bit.
Quick garbage collection in the wide-band data process- A . CAM Cell Array
ing can be performed by using a garbage flag register, a
multiple-response resolver, and a maskable address de- In order to develop a large-bit-capacity CAM LSI and
coder, as shown in Fig. 4. The garbage flag register is used to realize a partial-WRITE operation, a new CAM cell array
to indicate whether each word location is reserved or structure has been designed for the 20-kbit CAM LSI, as
empty. Before a WRITE operation starts, the multiple-re- shown in Fig. 5. This CAM cell array consists of associa-
sponse resolver selects one possible word location from tive-memory cells, word-line drivers, and a multi-way
among empty word locations in the same way as that of switching box.
the 4-kbit CAM LSI [lo]. The associative-memory cell circuit is composed of
In the wide-band data processing, quick garbage collec- seven/nine n-MOS transistors and two hgh-resistive poly-
tion is carried out by using the address data in a word Si load devices, as shown in Fig. 5(b). In contrast to a
serial manner. A word address for a datum with an exces- conventional transistor-load cell circuit, this high-resistive
sive bit length, which occupies two or more word locations, poly-Si load cell circuit allows the parallel-WRITE opera-
is generated in response to search operations. The gener- tion for multiple word locations at the same time. The
ated address data are sent to a maskable address decoder capability of a parallel-WRITE operation is essential for a
to identify a set of two or more contiguous word locations. CAM LSI, as pointed out in [lo]. A search operation is
Then, garbage flag registers located in successive words, achieved by detecting whether a precharged match line has
whose contents should be garbage, are set to be ONE at the discharged through the cell or not.
same time by the maskable address decoder output, as This cell circuit has two merits. One merit is the reduc-
shown in Fig. 4. In Fig. 4, (OFF),, is sent to the maskable tion of the number of transistors that constitute an EXCLU-
address decoder, which masks the LSB position of the SIVE NOR circuit. In t h s cell circuit, the EXCLUSIVE NOR
address. Therefore, two contiguous word locations (OFE),6 circuit is made up of three transistors. The conventional
and (OFF),, are identified by the decoder. EXCLUSIVE NOR circuit is composed of four transistors [ll].
The maskable address decoder can be realized by setting The other merit is small stray capacitance of the match
A , and at a high level, where A , is a masked address line because only one transistor is connected to the match
bit. In this CAM LSI, the lower three bits of input address line in the cell circuit. On the other hand, a drawback is
are maskable. Therefore, two, four, and eight contiguous slow discharging of the match line through the cell circuit.
word locations can be made garbage at the same time. In The gate voltage of the cell transistor discharging the
addition, the maskable address decoder is used as a con- match line is about VC.--2y;, where y; is threshold
ventional word address decoder under nonmaskable condi- voltage including the backgate bias effect, and VcL. is the
tions. supply voltage. In order to overcome this demerit, a simple
1018
25 ~
I
+ '
cl 15t
L- i - L >
accelerator circuit for discharging the match line is applied Fig 7. Spced-up tirile AT dcpendencc on stra! capacitancc or a match
(see Section 111-B). line C',
In this CAM cell array structure, the match line is also
used to identify data written/read word locations. The LA= ONE. In the case of mismatch, the charge of the match
partial-WRITE operation into arbitrary data parts is carried line begins to flow through (a) cell transistor(s). Then, the
out by using the word-line drivers and the match line. Any match-line potential V , falls down exponentially. When
additional horizontal signal line is not necessary, as shown the V , reaches the threshold voltage of transistor q.2.the
in Fig. 5 . Therefore, this CAM cell array structure is transistor goes to the ON condition. The node A potential
suitable for a large-bit-capacity CAM LSI. y4 becomes higher by charging up through q.?,then the
In a search operation, the word lines are isolated electri- q.3goes to the O N condition. As a result, an additional
cally from the match line and set at a low level by setting current path through T.? and T,.4 is generated. and the
signals AP, ( i = 0-3) and R W at a low level. Therefore. a discharging of the match line is accelerated. If the cut
search result propagates to the word operation block transistor q, is inserted. as shown in Fig. 6, the match line
through the multi-way switching box. Here, signals A P , and the output line are cut electrically on the way. The
and R W control whether the word line is isolated electri- stray capacitance of the output line is smaller than that of
cally from the match line or not. the match line. Therefore, the output line potential falls
In a data KEAD/WRITE operation, all-key-data line KD,, down more quickly. In order to estimate the speed-up
KD, is set at a low level to cut the current path from a time, the dependence of the speed-up time AT on stray
match line to the ground. A word drive signal is sent to the capacitance C, of a match line is calculated using a MOS
match line through the multi-way switching box. In a READ circuit simulator.
operation, all signals AP, and R W are set at a high level to The calculated results are shown in Fig. 7. Since the C,
connect the match line and word lines. Signals TP, ( j = of the match line is estimated at 0.3 pF, irrespectively of
0-7) are also set at a high level to activate the tag bit cells. the cut transistor, AT is about 8 ns, which is 30 percent of
In a WRITE operation, R W is set at a high level. and each total search operation time. The accelerator circuit without
AP, and TP, are set at a low/high level corresponding to a cut transistor was adopted for this CMA LSI. When the
write-mask data. Writing into data parts and tag bits. C, becomes larger. the cut transistor is more effective.
where AP, and TP, are set at a low level, is prohibited. This simple accelerator circuit contributes to realization of
This new cell array structure has two merits: one is cell a high-speed search operation.
array area reduction and the other is word-line delay
reduction. The cell array area is reduced by 25 percent as
compared with the structure in which a horizontal signal IV. FABRICATED
LSI
line is added to identify word locations. which is identical
to a double-word-line structure in a static RAM [14]. The The 20-kbit CAM LSI was fabricated using a 1.2-pin
word-line delay reduction is achieved because each word CMOS process technology with double-aluminum layers
line. which has small line capacity, is driven in parallel. for interconnection. A photomicrograph of the LSI is
shown in Fig. 8. A total of 284 000 devices have been
B. Accelerator Circuit for Discharging Match Line integrated on a 5.3 X 7.9-mm' chip. The newly designed cell
for data bit occupies 578 pm'.
A simple accelerator circuit for discharging the match An example of the 1 / 0 signal waveforms is shown in
line is designed as shown in Fig. 6. A search result is Fig. 9. Fig. 9 shows the timing relationships between clock.
detected whether a precharged match line is discharged or data, address. and search-response signals. The search-
not. Therefore, quick discharging of the match line is response signal indicates whether a matched word exists.
necessary to achieve a high-speed search operation. Waveforms are observed in a sequence of search-mask-data
The accelerator circuit is composed of four transistors. WRITE. search, and R E A D operations. The waveforms iiidi-
The match and output lines are precharged by setting L,, cate that the search-response signal changes t o ONE in a
and LA at a low level. A search operation starts in L , = search-operation cycle, and the searched-out data and ad-
OGURA et al.: 20-KBIT ASSOCIATIVE MEMORY LSI 1019
Data- Bus
Host
I Address- Bus
I
I -Control
_ _ - _ Flag
TABLE I1
A new CAM cell array structure was designed in order
OF THE 20-KBIT CAM LSI
FEATURES to reduce the chip area. A total of 284 000 devices have
been integrated on a 5.3 X 7.9-mm2 chip using 1.2-pm and
Configuration 512 words X 40 bits double-aluminum-layer CMOS process technology. A
Instruction set 26 instructions newly designed simple accelerator circuit allows for hgh-
Cycle time 85 ns (minimum) speed search operations. The measured minimum cycle
Supply voltage 5v time and power dissipation were 85 ns and 250 mW (at 10
Power dissipation 250 mW a t 10-MHz operation MHz), respectively. This associative memory, with its
U 0 interface U 0 common, TTL compatible proven highly efficient associative operation capabilities, is
Number of pins 66 expected to contribute significantly to the development of
Package &pin PGA high-performance artificial intelligence machines.
LSI process technology 1.2-pm CMOS with double aluminum layers
Number of devices 284,000
Chip size 5.3mm X 7.9mm APPENDIX
TYPICAL CONFIGURATION
SYSTEM
dling, as shown in Fig. 3(b), and generates several flags for Osaka Unibersity. Osaka. Japan, in 1976 and
1978. respectively
sequence control. In 1978 he joined the Musashino Electrical
Communication Laboratory. Nippon Telegraph
and Telephone Public Corporation (NTT).
Tok\o, Japan He is no\\ uith NTT LSI Labora-
ACKNOWLEDGMEN
I tories. Kanagawa. Japan He is interested in
logic-in-niemon, LSI’s and their application
Mr Ogura IS a member of the Institute of
The authors wish to thank Dr. T. Sudo, Dr. N. Ieda. Dr. Electronics, Information and Communication
T. Nakashinia, and Dr. T. Kimura for supporting this Engineers of Japan, the Information P r o c e \ m g
work. The authors also wish to express their indebtedness Society of Japan, and the IEEE Computer Society
to J. Naganuma for the design of this LSI.