Tcambased Distributed Parallel Packet Classification Algorithm W

TCAM-based Distributed Parallel Packet Classification
Algorithm with Range-Matching Solution’

Kai Zheng] Hao Che’, Zhijun Wang2, Bin Liu’
~
‘The Department of Computer Science, Tsinghua University, Beijing, P.R.China I00084

’The Department of Computer Science and Engineering; The University of Texas at Arlington, Arlington, TX 76019: USA
zk0l ~~ails.Tsinphua.edu.cn,{ hche, zwang:@cse. uta.edu, liub@Tsinghua,edu.cn
.Mwtruct-Packet Classification (PC) has been a critical data path access latency: limiting the throughput performance. Moreover,
function for many emerging networking applications. An most algorithmic approaches, e.g., geometric algorithms, apply
interesting approach is the use of TCAhl to achieve deterministic, only to 2-dimensIona1 cases. Although some heuristic
high speed PC. However, apart from high cost and power algorithms address higher dimensional cases, they offer
consumption, due to slow growing clock rate for memory
nondeterministic performance, which differs from one case to
technoIogv in general, PC based on the traditional single TCAM
another.
solution has difficult?. to keep up with fast growing line rates.
Moreover, thP TCAM storage efficiency is largely affected by the In contrast, l e m q content addressable memov (TCAM)
need to support rules with ranges, or range matching. In this based solutions are more viable to match high speed line rates,
paper, a distributed TCAM scheme that exploits while making software design fairly simple. A TCAM finds a
chiplevel-parallelism is proposed to greatly improve the PC matched rule in O(I) clock cycle and therefore offers the
throughput. This scheme seamlessly integrates with a range highest possible 1ookuph”ching performance. However,
encoding scheme, which not onIy solves the range matching despite its superior performance, it is still a challenge for a
problem but also ensum a balanced high throughput TCAM based solution to match OC I92 to OC768 line rates. For
performance. Using commerciaIly available TCAM chips, the example, for a TCAM with 100 MHz clock rate; it can perform
proposed scheme achieves PC performance of more than I00 100 million (M) TCAM lookups per second. Since each typical
million packets per second (Mpps) , matching OC768 (40 Gbps)
5-tuple policy table matching requires two TCAM loohps, BS
line rate.
will be explained in detail -later, the TCAM throughput for the
5-tuple matching is SOMpps. As aforementioned, to keep up
A-ey worrlF-System Design, Simulations
with OC 192 line rate, PC has to keep up with 25Mpps lookup
rate, which translates into a budget of two 5-tuple matches per
r. INTRODUCTION
packet. The budget reduces to 0.5 matches per packet at OC768.
Apparently, with LPM and firewslVACL competing for the
Packet Classification (PC) has wide applications in
same TCAM resources, it would be insufficient using a single
networking devices to support fiiewall, access control list
100 MHz TCAM for PC while maintaining OC192 to OC768
(ACL), and quality of service (QoS) in access, edge, andior
line rates. Although increasing the TCAM clock rate can
core networks. PC involves various matching conditions, e.g.,
improve the performance, it is unlikely that a TCAM
longest prefix matching (LPM), exact matching, and range
technology that matches the OC768 line speed d l be available
matching, malung it a complicated pattem matching issue.
anytime soon, given that the memory speed improves by only
Moreover, since PC lies in the critical data path of a router and
7% each year [ 171.
it has to act upon each and every packet at wire-speed, this
Instead of stnving to reduce the access latency for a single
creates a potential bottleneck in the router data path,
TCAM, a more effective approach is to exploit chip-level
particularly for high speed interfaces. For example, at OC192
parallelism (CLP) io improve overall PC throughput
(10 Gbps) full line rate, a line card (Le)needs to process about
performance. However, a naive approach to realize CLP by
25 million packets per second (Mpps) in the worst-case when
simply duplicating the databases to a set of uncoordinated
minimum sized packets (40 bytes each) arrive back-to-back. As
TCAM chips can be costly, given that TCAM is an expensive
the agp-egate line rate to be supported by an LC is moving
commodity. in a previous work [I61 by two of the authors of
towards OC758, it poses significant challenges for the design of
the present work, it wa5 demonstrated that by making use of the
packet classifiers to allow wire-speed forwarding.
structure of IPv4 route prefixes, a multi-TCAM solution that
The existing algorithmic approach including geometric
exploits CLP can actually achieve high throughput performance
algorithms based on the hierarchical trie [I] 121 [3] [4] and most
gain in supporting LPM with low memoq cost.
heuristic aIgorithms [6] [7j [SI [9] generally require
Another important benefit of using TCAM CLP for PC is its
nondetmninistic number oE memory accesses €oreach lookup,
ability to effectively solve the range matching prdblem. [ 101
which makes it difficult to use pipeline to hide the memoq
reported that today’s real-world policy filtering OF) tables
involve significant percentages of rules with ranges. Supporting
‘This research is supported bv the NSFC (No.60173009& No.60373007)and rules with ranges or range matching in TCAM can lead to very
the National 863 High-tech Plan (No.ZOO3A4115110 & No. low TCAM storageefficiency, e.g., 16% asreported in [lo]. [lo]
2OO2.4.4~03011-1).
proposed an extended TCAM scheme to improve the TCAM
(C)2005 IEEE
0-78O3-8968-9/05/$20.00 293
storage efficiency, in which TCAM hierarchy and circuits for DIP( 1-32), SPORT( 1-16). DPORT (1 -16), PROT( I @): where
range comparisons are introduced. Another widely adopted SIP, DIP, SPORT, DPORT, and PROT represent source IF'
solution to deal with range matching is to do a range address, destination IP address, source port, destination port,
preprocessing/encoding by mapping ranges to a shofi sequence and protocol number, respectively. DIP and SIP require longest
of encoded bits, known as bit-mapping [I I]. The application of prefis matching (LPM); WORT and DPORT generally require
the bit-map based range encoding for packet classification range matching; and PROT requires exact matching. Except for
using a TCAM were also reported 1111 [12] [13] [14] [lS]. A sub-fields with range matching, any other sub-field in a match
key challenge for range encoding is the need to encode multiple condition can be expressed using a single string of teman hits,
subfields in a search key extracted from the packet to be i.e., 0, 1, or"don't care" *. Table I gives an esample of a typical
classified at wire-speed. To achieve high speed search key five-tuple rule table.
encoding parallel search key sub-field encoding were proposed
in [11][13], which however, assume the availability of multiple TABLE I .4n Example of Rille Table with 5-tuple Rilles
processors and multiple memories for the encoding. To ensure SrcIP DstTP SrrPort DstPort Prot Action
the applicability of the range encoding scheme to m y
commercial network processors and TCAM coprocessors, the
authors of this paper proposed to use TCPLM itself for
_ _ I l...l * .*
1,l
L2
L3
La
2.2.2.2
3.3.*.*
*,e.*.*
l 2-. * .* .*
I.].*.*
*.*.*.*
3.4.4.*
I *
*
>lo23
I +
256-512
5000-6000 D l 0 2 3
I 6
6
' 512-1024 ' 11
*
?El
1 +IAF
3 Acce tsd
I
sequential range encoding E1 51, which howeverl reduces the

TCAM throughput performance. Using TCAM CLP for range
L5 '.*.*.*
._. ....,.
I *.*.*.*
..e...
I <lo23
......
b
...... I *...... I +H
+, ......
encoding provides a natural solution which solves the
performance issue encountered in [IS].
However, extending the idea in [16] to allow TCAM CLP for
general PC is a nontrivial task for the following two reasons: 1)
the structure of a general policy rule: such as a 5-tuple rule is
much more complex than that of a route and it does not follow a
simple structure like a prefix; 2 ) it involves three different
matchng conditions including prefix, range, and exact
matches.
In this paper, we propose an efficient TCAM CLP scheme,
called Distributed Parallel PC with Range Encoding
(DPPC-RE), for the typical 5-tuple PC. First, a rule database
partitioning algorithm is designed to allow different partitioned
rule groups to be distributed to different TCAMs with
minimum redundancy. Then a greedy heuristic algorithm is
proposed to evenly balance the ttaffk load and storage demand
among all the TCAMs. On the basis of these algorithms and
combined with the range encoding ideas in [ 153, both a static
algorithm and a fully adaptive algorithm are proposed to deal
with range encoding and load balancing simultaneously. The
simulation results show that the proposed solution can achieve
100 Mpps throughput performance matching OC768 line rate,
with just SO% additional TCAM resource compared with a
single TCAM solution at about 25 Mpps throughput
performance.
The rest of the paper is organized as follows Section 11 gives
the definitions and theorems which will be used throughout the
paper. Section III presents the ideas and algorithms of the
DPPC-RE scheme. Section IV presents the implementation
details on how to realize DPPC-RE. The performance
evaluation of the proposed solution is given in Section V.
Finally, Section VI concludes the paper.
11. DEFINITIONS AND THEOREMS

Rrrles: A rule table or filtering table includes a set of
nialch conditiolis and their corresponding actions. We consider
the typical 104-bit five-tuple match conditions, i.e., (SIP( 1-32),
" The bits in the sub-field are ordered with lhc 1st bit (MSB) lies in the leftmost
position.
294
encoded ranges only takes one rule e n @ +thus : significantly For example, for P=3, the Key-ID group "0 11" is composed of
improving TCAM storage efficiency, thefoflowinpXRule-IDgroups."011,*11, O*I, 01*, * * I ,
Let !fro denote the rule table size, or the number of ruies in a *I*, o**, ***",
rule tables N represent thc number of TCAM entries required An immediate obsewation is that different key-ID groups
to accommodate the rule table without range encoding; may overlap with one another in the sense that difi'erent key-LD
stand for the number of T C M entnes required to
accommodate the rule table with range encoding. groups may have common Rule-ID groups.
Seaid?Key: A search key is a 104 binaq bit string composed
Disrriburcd Storage Expunsiun Ratio: Since Key-ID groups
may overlap with one another, we have.
of a five-tuple. For example, 4 . 1 . 1 . 1 , 2.2.2.2, 1028, 34556,
1Ip is a five-tuple search key. In general, a search key is
extracted by a network processor from the IP header and passed
to a packet classifier to match against a five-tuple rule table. where IAJrepresents the number of elements in set A. In other
Adatching: In the contest of TCAM based PC as is the case in words, using Key-ID to partition rules and distribute them to
this paper, matching refers to ternaT matching in the following dift'erent TCAM introduces redundancy. To formally
sense. h search key is said to match a particular match characterize this effect, we further define Disfribuled Sror-age
condition, if for each and every corresponding bit position in Erpairsion Ratio (DER) as DER = D ( N , K ) I N where D ( N , R )
~
both search key and the match condition, either of the following represents the total number of rules required to accommodate hr
two conditions is met: (1) the bit values are identical; (2) the bit rules when rules are' distributed to K diRerent TCAMs. Here
in the match condition is "don't care" or *. DER characterizes the redundancy introduced by the
So far, we have defined the basic terminologies for rule distributed storage of rules with or without range encoding.
matching. Now we establish some important concepts upon
which the distributed TCAM PC is developed. Th"gllput mid TrafJic Intensiw: In this paper, we use
ID: The idea of the proposed distributed TCAM PC is to throughput, traffic intensity, and throughput ratio as
make use of a small number of bit values extracted from certain peribmance measures of the proposed solution. Thmrghptir is
bit positions in the search key and match condition as IDS to (I) defined as the number of PCs per unit time. It is an important
divide match conditions or rules into p u p s , which are mapped measure of the processing power of the proposed solution.
to different TCAMs; ( 2 )direct a search key to a specific TCAM Trufic inrensig is used to characterize the workload in the
for rule matching. system. As the design is targeted at PC at OC768 line rate, we
In this paper, we use P number of bits picked from given bit define traffic intensity as the ratio between the actual traffic
positions in the DIP, SIP, and/or PROT sub-fields of a match load and the w " s e traffic load at OC768 line rate, i.e., 100
condition as the rule ID, denoted as Rule-ID, for the match Mpps. Tllroughptrt !.uti0 is defined as the ratio between
condition and use P number of bits ei2racted from the Throughput and the worst-case traffic load at OC768 line rate.
corresponding search key positions as the key ID, denoted as Now, two theorems arc established, which state under what
Key-ID, for the search key. For example, suppose P = 4, and conditions the proposed solution ensures correct rule matching
they are extracted from SIP(1), DIP(7),DIP(16) and PROT(8). and maintains the orignal ordering of the packets, respectively.
Then the rule-ID for the match condition < I . 1 .*,*,2.*.*.*, *, *,
Theorem 1: For each PC, corrcct rule matching is guaranteed if
6> is"Ol*O"and the key-ID forthe searchkey 4 . 1 . 1 . I, 2.2.2.2,
a) All the rules belonging to the same Key-l-ID group are
1028,34556, 11> is "0101 " .
placed in the same TCAM with correct priority orders.
ID Gmups: We define all the match conditions having the
b) A search key containing a given Key-ID is matched
same Rule-ID as a Rule-IDgmip. Since a Rule-ID is composed
against the rules in the TCAM, in which the Corresponding
of P ternary bits, the match conditions or rules are classhed
into f Rule-ID groups. If "*I' is replaced with ' T ,we get a Key-ID group is placed.
ternary value for the Rule-ID, which uniquely identifies the Proof: On the one hand, a necessary condition for a given
Rule-ID group (note that the numerical value for dlfferent search key to match a rule is that the Rule-1D for this rule
Rule-IDs are dlfferent). Let hYDlbe the Rule-ID with value j matches the Kev-ID for the search key. On the other hand, any
and RGI represent the Rule-ID group with RuIe-ID valuej. For rule that does not belong to this Key-ID group cannot match the
example, for P=4 the Rule-ID group with Rule-ID "OO* 1 " search key, because the Key-ID group contains all the rules that
isRG,, since the Rule-ID valuej =jO0,71J3- 7. match the Key-ID. Hence, a rule match can occur only between
Accordingly, we define the sei of all the Rule-ID groups with the search key and the rules belonging to the Key-ID group
their Rule-IDs matching a given Key-ID as a Key-ID givup. corresponding to the search key. As a result, meeting conditions
Since each Key-ID is a binay value, we use this value to a) and b) will guarantee the correct rule matching
uniqueIy identify this Key-ID group. In parallel to the Theorem 2: The original packet ordering for any given
definitions for Rule-ID, we define Key-ID H D , with value i application flow is maintained if packets with the same Key-ID
as a Key-ID group KG, . We have a total number of ,r" nhy-ID are processed in order.
groups. Proof: First, note that packet ordering should be maintained
With the above definitions, we have only for packets betonging to the same application flow and an
application flow is in general identified by the five-tuple.
KG,= URG, .
IUD, matchkIQ
Second, note that packets from a given application flow must
have the same Key-ID by definition. Hence, the original packet
295
ordering for any given application Bow is maintained if packets (TCP and UDP PROTs have different values at these two bit
with the same Key-TD are processed in order. positions) of the PROT sub-field as one of the ID bits. All the
rest of the bits in the PROT sub-field have fixed one-to-one
mapping relationship with the 8th or 5th bits, and do not lead to
111. ALGORITHMS AND SOLUTIONS anv new information about the PROT;
3 ) Note that the rules with wildcard(s) in their Rule-IDs are
The key problems we aim to solve are I ) how to make use of
actually those incumng redundant storage. The more the
CLP IO achieve high performance with minimum cost; 2) how wildcards a rule has in its Rulr-ID; the more Key-ID groups it
to solve the TCAM range matching issue to improve the TCAM
belongs to and consequent!!; the more redundant storage it
storage efficiency (consequently controlling the cost and power
incurs. In the 5 real-world rule databases, there are over 92%
consumption). A scheme called Distributed Parallel Packet
rules xvhose DIP subfields are prefixes no longer than 25 bits
Classification with Range Encoding (DPPC-RE) is proposed.
and there are over 90% rules whose SIP sub-fields are prefixes
The idea of DPPC is the follo\ving. First, by appropriately
no longer than 25 bits, So we choose not to use the last 7 bits
selecting the ID bits, a large rule table is partitioned into several
(i.e., the 26th to 32nd bits) of these two sub-fields, since they
Key-ID groups of similar sizes. Second, by applying certain
are uddcards in most cases.
load-balancing and storage-balancing heuristics, the rules
(Key-ID groups) are distributed evenly 10 several TCAM chips. Based on these 3 empirical rules, the traversal is simplified
As a result, multiple packet classifications corresponding to as: choose art optimal (P-1)-bit combination out of 50 bits of
different Key-ID groups can be performed simultaneously, DIP and SIP sub-fields (DIP(1-25), SIP(1-25)), and then
which significantly improves PC throughput performance combine these (P-1) bits with PROT(8) orPROT(5) to form the
without incuning much additional cost. P-bit ID.
The idea of RE is to encode the range sub-fields of the rules Fig2 shows rtn example of the ID-bit selection for Database
and the corresponding sub-fields in a search key into bit-vectors, #5 [lS] (with I550 total number of rules). We use an equally
respectively. In this way: the number of temay strings (or weighted sum of two objectives, i.e., the minimization of the
TCAM entries, which will be defmed shortly in Section II1.C) variance among the sizes of the Key-ID groups and the total
required to express a rule with non-trivial ranges can be number of redundant rules, to find the 4-bit combination:
significantly reduced (e.g. to only one string), improving PROT(S), DIP(I)>DIP(21) and SIP(4)"'.
TCAM storage efficiency. In DPPC-RE, the TCAM chips that We find that, although the sizes of the Rule-ID p u p s are
are used to perform rule matching are also used to perform unbalanced, the sizes of the Key-ID groups are quite similar,
search key encoding. This not only offers a natural way for which allows memory-efficient schemes to be developed for
parallel search key encoding, but also makes it possible to the distribution of rules to TCAMs.
develop efficient load-balancing schemes, m h n g DPPC-RE 150
*
11)
indeed a practical solution. In what follows, we introduce I 1

DPPC-RE in detail.
A. ID Bits Selection
The objective of ID-bit selection is to minimize the number of
redundant rules (introduced due to the overlapping among
Key-ID groups) and to balance the size of the Key-ID groups
(large discrepancy of the Key-ID group sizes may result in low
TCAM storage utilization). Key-ID
A brute-force approach to solve the above optimization Fig. 2 ID-bit Selection Result of Rule Database Set #5.
problem wouId be to traverse all of the P-bit combination out of
IfT-bitrules to get the best solution. However, since the value of
W is relatively large (104 bits for the typical S-tuple rules), the B. Distributed Table Consfrucrion
complexity is generally too high to do so. Hence, we introduce The next step is to evenly distribute the Key-ID ~ F O U ~toSK
a series of empirical rules based on the 5 real-world database TCAM chips and to balance the classification load among the
analyses [ 181 that are used throughout the rest of the paper to TCAM chips. For clarity, we first describe the mathematical
simplify the computation as follows: model of the distnbuted table construction problem as Follows.
1) Since the sub-fields, DPORT and SPORT, in a rule may Let:
have non-trivial ranges whch need io be encoded, we choose gk be the set of the Key-ID p u p s placed in TCAM #k
not to take these two sub-fields into account for ID-bit where k-1,2, .._, K?
selection; W [ j ] , j= be the frequency ofK'ID, appearances in
2 ) According to the analysis of several real-world rule the search keys, indicating the rule-matching load ratio of
databases 1181, over 70% mles are r+5thnon-wildcarded PROT Key-ID group KG, ;
sub-field, and over 95% of these non-wildcarded PROT m[[k]be the rule-matching load ratio that is assigned to
sub-fields are either TCP(6) or UDP( 11) ( a p p r o h a t e l v 50%
are TCP). Hence, one may select either the 8th or the 5th bit
The leftmost is the least significant
G[k]be the number of rules distributed to TCAM #k, namely.
G [ k ].=j UKG, 1 , h a d First Algorithm (LFA): In this algorithm, the capacity
rGQr objective is regarded as a constraint. The Key-ID groups with
c,
be the capacity tolerance of the TCAM chips (the relatively larger traffic load ratio will be assigned to TCAM
maximum number of rules it can contain), and L, be the first, and the TCAM chips with lower load are chosen.
tolerance (maximum value) of the traffic load ratio that a -__--_______-___-_____
TCAM chip is expected to bear. I) Sort ( i , j = 1.2,. . . . 2 p 1. in decreasing order of W [ i ] ,and record
The optimization problem for distributed table construction the result as! ~ ; d [ l~idp]:.,.:], A I 4 2 '1 1.;
. I
IS given by: IQforifrom 1 to 2p do

_ I _ _ _ _ _ _ _ _ - _ _ _ c _ _ _ _ _ _ _ _ _ _ _ l _ l _ _ _ l _
Sort {k,k = 1,...,K ) in increasing order of&Zd[k] and

_l______l____l__-__l____I____-_______
1 of the Key-ID
= l..,,$x
~
Find aK-division { Q,.k groups that

Minimize: record as I s c [ ~ ] : s c .....
~ ~ S~ C j;
[ ~ ]
for k from 1 to K do
if I P S c l t l U,,G
,', 'I ' 1
U
then O S C j k , - QSC[k, K G R , d [ r ,
I
GIWklI =IQSQtJ I :
+ F[Kid[i]l;
And[Sc[P]]= EL14[Sc[k]]
break;
IIQ Output {Qk , k = 1,..., K}md { R M [ / i ] : k= 1,....K 1.
____________-____-_---__--__----------
Consider each Key-ID group as an object and each TCAM The Distributed Table Construction Scheme: The two
chip as a knapsack. We find that the problem is actually a algorithms may not find a feasible solution with a given L,
variance of the Wkiglit-Knapsackproblem, which can be proved value. Hence, they are iteratively run by relaxing& in each
to be NP-hXd. iteration until a feasibIe solution is found. In a given iteration,
Note that the problem has multiple objectives, which cannot if only one of the two algoritlms finds a feasible solution, this
be handled by conventional greedy methods. In what follows, solution would be the final one. If both algorithms find feasible
we first develop two heuristic algorithms with each taking one solutions; one of them chosen according to the following rules:
of the two objectives as a constraint and optimize the other. Suppose that G, and RM,are the two objectives given by
Then, the two algorithms are run to get two solutions, algorithm A (CFA or LFA), while G, and m,
are the two
respectively, and the better one is chosen finally. objectives given by a different algorithm B &FA or CFA).
1) If G,<G,, andfil,<fl/l,, we choose the solution e v e n
Capacity First Algorithm (CFA): The objective kpKm[kl by algorithm A ;
is regarded as a constraint. In this algorithm, the Key-ID groups 2) If G,<G,, butl&f,>RniP,, we choose the solution given
with relatively more rules will be distnbuted first. In each round, by algorithm A when Rhd, <2lK (the reason will be revealed
the current Key-ID group will be assigned to the TCAM with shortly in Section III.D), othenvise we choose the solution
given by algorithm B.
the lzast number of rules under the load constraint.
-__-__--__---__-___-__--__---_-----_
-__--__--__ll__--__l__-l__---_ll--l
The corresponding processing flow is depicted in Fip.3.
I) Sort f i , i = 1,2,..JP ) in decreasing order of 1 KG, I, and

record the result as( k ' i d l l ] , ~ Z ~ [ 2 ] , . . . , ~ j>;
~[2']
1I)forifrom 1 to 2 p do
Sort (k,k= l , . , . : }~ in increasing order o f G [ k ] and ,
record as { scliI, s ~ p...].,SC[KJ };
for k from 1 to K do
+ W[Kid[i]]5 L,
if RAd[Sc[k]]
then UK G K ~ ~ , ,
= !?S+,
+
Bat Feaslrle b b
~ r ~ c r k=Ii i I
Rn4[Sc[k]]= ILzf [Sdk]]t CV[Kid[i]]; Fig. 3 Distributed Table Construction Flow.
break, We still use the rule database set #5 as an example. Suppose

that the traffic load distnbution among the Key-ID groups is a s
IIQ Output {Qk,k =I,..-,K)md{RM[k],k=I, ....R >,
depicted in Fig. 4, which is selected intentionally to have large
--__-_______l____-_I________________
variance to create a difficult case for load-balancing.
297
Note that the ID-bits are PROT(S), DIP(l), DIP(21), and For K=5, CFA produces a better result (both objectives are
SP(4) as obtained in the last subsection. Given the constraint better) than that of LFA, as shown in TABLE 11. We find that
C, =600 and L, =30%, the results for K=4 and K=5 are shown in the numbers of rules assigned to different TCAMs are very
Tables I1 and 111, respectively. close to one another. The Distributed storage Espansion Ratio
(DER) is 1.51, which means that only about soo/a more TCAM
entries are required (note that 200% (at K=2) or more are
15 required in the case when the rule table is duplicated and
assigned to each TCAM). The maximum traffic load ratio is
-
7 li 29.4%<2/K=40%. As we shall see soon, using the
5 1D load-balancing schemes proposed in Section III.D, this kind of
In traffic distribution can be perfectly balanced.
s
B s For K=4, LFAinstead produces a better result than that of CFA
+
d
and the maximum and minimum traffic load ratios are 23.9%
2
and 23.5%, respectively, very close to a perfectly balanced
load.
0 1 2 3 1 5 6 1 8 P 1 0 1 1 1 * 1 1 , * , s
Kay-ID
Fig. 4 Traffic Load Distributioii among the Key-ID (Ley-ID groups).
TABLE 11When K=5. CFAgives the best result. No iteratioil is needed.

Kev-ID Gmuns (Table Contents) 1 NumberofRule-ID I Xumberof I TmfEc Load 1
matched against the rule table to get the final result In summay,
C. Solutions for Range Matching a PC with range encoding requires S range table lookups for KE
and 1 RM lookup.
Range matching is a critical issue for effective use of TCAM
for PC. The real word databases in [lo] showed the TCAM For typical 104-bit five-tuple rules: ranges only appear in the
storage efficiency can be as low as 16% due to the existence of a source and destination port subfields, and hence only 2 range
large number of rules with ranges. We apply our earlier proposed tables are needed. For a TCAM with 64-bit slot size, each rule
Dynamic Range Encoding Scheme (DRES) [ 151 to distnhuted takes 2 slots, and leaves 24 free bits for range encoding. Each
TCAMs, in order to improve the TCAM storage efficiency. RM takes 2 TCAM IooLups (each slot takes 1 lookup). A range
D E S [15] makes use of the free bits in each rule entry to coming from the source/destination port sub-field takes 1 slot in
encode a subset of ranges selected from any rule sub-field with a range table and hence incurring 1 TCAM lookup for each
ranges. An encoded range is mapped to a code vector range table matching. In summary, there are a total number of 4
implemented using the free bits, and the corresponding subfield TCAM loohups per PC. With a 100 M H z TCAM at 100 million
is wildcarded. Hence, a rule with encoded ranges can be lookups per second, DRES can barely support OC192 (i.e. 25
implemented in 1 rule entry, reducing the TCAM storage usage. Mpps) wire-speed performance. The distributed TCAM
To match an encoded rule, a search key is preprocessed to scheme that exploits CLP to increase the TCAM lookup
generate an encoded search key. This preprocess is called search performance is needed to support line rates higher than OC 192
Key Encoding (KE). Accordingly, the PC process in a TCAM The following sections present the details on how to
with range encoding includes two steps: search KE and Rule incorporate D E S into the proposed distributed solution
Matching (RM). D E S uses the TCAMcoprocessor itself for KE
to achieve wire-speed PC performance. If the encoded ranges D.Eficient Load-balmxi fig Schemes
come from S sub-iields, S separate range tables are needed for Note that the DPPC formulation is static, in the sense that
search KE. The S range tables as well as the rule table can be once the Key-ID groups are populated in hfferent TCAMs, the
allocated in the same or different TCAMs. The KE involves S performance is pretty much subject to trafic pattem changes.
sub-fields matching against the corresponding S range tables to The inclusion of Range Encoding provides us a very efficient
get an encoded search key. Then the encoded search key is
298
way to dynamically halance the PC traffic in response to traffic half of that of the RM tasks.
pattern changes. The key idea is to duplicate range encoding
tables to all the TCAMs and hence allow a KE to be performed
Full Adaptation (FA): The idea of.FA is to use a counter to
using any one of the TCAMs to dynamically balance the load.
keep track of the current number of backlogged tasks in the
Since the size of the range tables are small, e.g.,no more than
buffer at each TCAM chip. Whenever a packet arrives, the
15 entries for all the 5 real-world databases, duplicating range
corresponding KE task is assigned to the TCAM who has the
tables to all the TCAMs does not impose distinct overhead.
smallest counter value.
We desigo two algorithms for dynamic RE. First, we define
In this case, the values of .4L-,i] are not fixed. The
some mathematical terms. Let D[k]be the overall traffic load
expression of D[k]is given by:
ratio assigned to TCAM #k (k=1,2, ... K), which includes two
parts, i.e., the KE traffic load ratio arid the RM traffic load ratio, D[k]= 0.5x R.tJ[k] + 0.5 x (A[X-,l]x Xn.i[I]] -+ _ _ .+ d [ k .K]x RUIKJ).
with each contributes 50% of the load, according to Section Note that 0 IA [ k ; ; ]I I, i = 1: ...: K , we have
1II.C. 0.5xRU [A-] ID[k]II ,li = L...: K .
Let KE[k] and RiU[k] be the KE and RM traffic ratio
allocated to TCAM #k, (k= It2,.. . ,KJ,respectively. Note Taking -qi.k] as tunable parameters, it is straightfonvard
that ,%up] is determined by the Distributed Table Construction that the equations:
process ( d e r to Section 1II.B). 1 / K = Q k j = 0.5xfi\flk-] +0.5X(.4p,l] x m q ] +... +A[k-,LqXmqK)
Let A[i,X-], A[i,X-]2 o , i , k = 1, ..., K , z i l [ i : k ]= 1 , be the f = l,....K
I must have feasibIe solutions when 0.j x MJp]5 1/E=
Aditistinent FacLwhfalJ-ix, which is defined as the percentage i.e., ILl[li]I2 / K , H = 1,... K .
~
(ratio) of the KE tasks allocated to TCAM #i, for the This means that if the conditions: Rn4[k]5 2 / K , h- = l.....K,
corresponding RM tasks which are performed in TCAM #k. are all satisfied, the overall traffic load ratio can be perfectly
Then the dynamic load balancing problem is formulated as balanced (the objective value is 0) in the presence of traffic
follo~vs: pattem changes.
--__---__---__--_-_-______I_______I__
To decide .4[i,kJ,i,k= l:~..,K Comments: The overall traffic load can be perfectly balanced
when j&f[k] 5 2 / K ,k = 1,..._ ,are satisfied, which makes FA
Minimize: a vep efficient solution when compared with SRR. However,
FA incurs more implementation cost due to the need of a
hdux 4X-
-]
h4in D[k]. counter for each TCAM chip.
k=l. .K k=l, ...h
Further discussions on the performance of SRR and FA are
Subjed to: presented in Section V
LJk] = U.jxKE[[X-]+ 0 . 5 x W [ X - ]k, = 1,..., K ;
~ [ i= ] C ~ [ i ,XRM[&]
li] ,i = L-., K IV. MLEMENTATION
OF TWE DPPC-RE SCHEME
k=I. .. .K
_ _ _ l _ _ _ _ _ _ _ l _ _ _ l _ _ _ _ _ _ _ _ _ _ _ _ _ l _ _ _ l _
-I_-f___--__-___-f___-X__---__-------------
The detailed implementation of the DPPC-RE mechanism is
The following two aIgorithms are proposed to solve the depicted in FigS. Beside the TCAM chips and the associated
above problem. SRAMs to accommodate the match conditions and the
associated actions, three major additional components are
Stagger Round Robin (SRR): The idea is to allocate the KE included in co-operating with the TCAM chips, i.e., a
tasks of the incoming packets whose RM tasks are performed in Distnbutor, a set of Processing Units (PUS) and a Mapper.
a specific TCAM to other TCAM chips in a Round-Robin Some associated small buffer queues are used as well. Now we
fashion. Mathematically, this means that: descnbe these components in details.
A[k,k]= O a n d & k J = ~ I ( K - ~ ) , . i # k , i , k =1,....K .
We then have, A. The Distributor
Qk] =O.Sx(w] +...+I?@ -I]+Rbp+l]+.. . + ~ q y ( K - l ) + O . 5 x R q k ] This component is actually a scheduler. It partitions the PC
traffic among the TCAM chips, More specifically, it performs
k = i , . , . , ;~therefore
three major tasks. First, it extracts the Key-ID from the 5-tuple
received from a network processing unit (NPU). The Key-ID is
used as an identifier to dispatch the RM keys to the associated
TCAM The 5-tuple is pushed into the RM FIFO queue of the
corresponding TCAM (Solid arrows in Fig. 5).
Comments: In the case when K=2, the objective is a constant Second. the distributor distributes the KE traffic among the
"0". This means that no matter how large the variance of the TCAM chips, based on either the FA or SRR algorithm. The
RM load ratios among all the TCAM chips is, SRR can always correspondmg information, i.e., the SPORT and DPORT are
perfectly balance the overall traffic load. pushed into the KE FIFO of the TCAM selected (dashed arrows
Sinceosx (x'- 2 ) / ( ~- 1 ) < 0 . 5 , it means in any case, SRR can in Fig.5).
always reduce the variance of the overall load ratio to less than Third, the distributor maintains K Serial Numbers (9" or
)
299
SM counters, one for each TCAM An S/N is used to identify A KE FIFO is a small FIFO queue where the information
each incoming packet (or more precisely, each incoming used for KE is held. The format of each unit in the K E FIFO is
five-tuple). Whenever a packet arrives, the distributor adds " 1" given in Fig.6(c).
(cyclical with modulus equal to the RM FIFO depth) to the S/N
Differing from the RM and KE FIFOs, a Key Buffer is not a
counter for the corresponding TCAM the packet is mapped to.
FIFO queue, but a fast register file accessed using an S/N as the
A Tag is defined as the combination of an S/N and a TCAM
address. It is where the results of KE (encoded bit vectors of the
number (CAMID). This tag is used to uniquely identif), a
range subfields) are held. The size of a K q Buffer equals to
packet and its associated RM TCAM. The format of the Tag is
the size of the corresponding RM FIFO, with one unit in the
depicted in Fig.G(a).
Key Buffer corresponds to one unit in the RM FIFO. The
format of each unit is given in Fig.6(d). The Pidid bit is used to
indicate whether the content is available and up-to-date.
Note that the tags of the key cannot be passed through
TCAM chips during the matching operations. Hence a Tag
FIFO is designed for each TCAM chip to keep the tag
information when the associated keys are being matched.
C. The Processing Unit

Each TCAM is associated with a Processing Unit (PU). The
fimctions of a PU are to (a) schedule the RM and KE tasks
assigned to the corresponding TCAM, aiming at maximizing
the utilization of the corresponding TCAM; (b) ensure that the
results of the incoming packets assigned to this TCAM are
returned in order. In what follows, we elaborate on these two
functions.
(a) Scheduling between RM and KE tasks: Note that, for any
given packet, the RM operation cannot take place until the KE
results are returned. Hence, it is apparent that the units in a RM
FIFO would wait for a longer time than the units in a KE FIFO,
For h s reason: RM tasks should be assigned higher priority
than KE tasks. However, our analysis (not given here due to the
page limitation) indicates that a stnct-sense priority scheduler
Ftg. 5 DPPC-RE mechanism may lead to non-deterministically large processing delay. So we
introduce a Weighted-Round-Robin scheme in the PU design.
As we shaIl explain shortly, the tag is used by Mapper to More specifically, each type of tasks gain higher priority in turn
retum the KE results back to the correct TCAM and to allow the based an asymmetrical Round-Robin mechanism. In other
PU for that TCAM to establish the association of these results words, the KE tasks will gain higher priority for one turn (one
with the corresponding five-tuple in the RM queue. turn represents 2 TCAM accesses, for either a RM operation or
two successive KE operations) after P I turns with the higher
[ SIN(5) I CAhllD(3) I priority assigned to RM tasks. Here n is defined as the
(a) TagFormat
Round-Robin Ratio (RFUl).
I PROT(8) 1 DIP(32) I SIP(32) I DPORT(16) I SPORT(l6) 1 Tag(8) I (b) Ordered Processing: Apparently, the order of the returned
(b) RM Buffer Format PC results from a speclfic TCAM is determined by the
processing order of the RM operation. Since a RM buf5er is a
I DPORT(16) I SPROT( 16) I Tag@) I FLFO queue, the PC results can still be returned in the same
( c ) KE Buffer Format order as the packet arrivals, although the KE tasks of the
packets may not be processed in their original sequence '".As a
1 Valid(l) 1 DPK(8) I SPK(8) I result, if the KE result for a given Rhd unit returns earlier than
(d) hey Buffer Format
those units in front of it, this RM unit cannot be executed.
Fig. 6 Format of Tag, Rh4 FIFO. I;E FIFO arid Key Bufkr. Specificallv, the PU far a given TCAM maintains a pointer
points to the position in the Key Buffer that contains the KE
result corresponding to the unit at the head of the RM FIFO.
B. RM FIFU. KE FIFO, Key Buffer and Tag FIFU The value of the pointer equals the S/N of unit at the head.FW
ARM FIFO is a small FIFO queue where the infomation for FIFO. In each TCAM cycle, PU queries the valid bit of the
RM of the incoming packets is held. The format of each unit in
the RM FIFO is gven in Fig.6(b). (The numbers in the brackets iv
This is because the KEtasks whose Rhl is processed in a specific TCAhf may
indicate the number of memory bits needed for the subfields). be assigned to different TCAMs to be processsd based on the FA or SRR
algorithms.
300
position that the pointer points to in the Key buffer^ If the bit is 0:
When both results are received by Mapper, it combines
set, meaning that the KE result is ready, and it is RM‘s turn for them into one, and pops the head unit from the Tag FIFO.
execution, PU reads the KE results out from the Key Buffer and @: The CAMID field “001” in the Tag indicates the result
the 5-tuple information out from the RM FIFO queue, and should be Sent back to the Key Buffer of TCAM#l, while the
launches the RM operation. Meanwhile the valid-bit of the S/N field “001 IO” indicates that it should be stored in the @
current unit in the Key Buffer is reset and the pointer is unit of the Key Buffer. Meanwhile, the corresponding valid bit
incremented by 1 in 3 cyclical fashion. Since the S/N for a IS set.
packet in a specific TCAM is assigned cyclically by the 0: Suppose that all the packets before packet PO have been
Distributor, the pointer is guaranteed to always point to the unit processed, and PO is now the head unit in the RM FIFO of
in the Key Buffer that corresponds to the he3d unit in the RM TCAM#l Note that packet PO has S/N ”00 I 10”. Hence, when
FIFO. it is the RM’s turn, PU#I probes the valid bit of thr 6thunit in
the Key Buffer.
@:When PU#I finds that the bit is set, it pops the head unit
D.The Mapper from the RM FIFO (the 5-tuple) and reads the contents out from
The funcrion of this component is to manage the result the 6’h unit of the Key Buffer (the encoded key of the two
returning process of the TCAM chips. According to the ranges), and then launches a RM operation in TCAM#I.
processing flow of a PC operation, the mapper has to handle Meanwhile, the valid bit of the 6‘hunit in the Key Buffer is reset
three types of results, i.e., the KE Phase-I results (for the and the pointer of PU#l is incremented by one and points to the
SPORT sub-field), the KE-Phase-I1 results (for the DPORT P unit.
sub-field): and the RM results. The type of the result is encoded @: When Mapper receives the RM result, it returns it back to
in the result itself. the NPU, completing the whole PC process cycle for packet PO.
If the result from any TCAM is a RM result (which is decoded
from the result itself), the mapper returns it to the NPU directly v. EXPERIMENTAL
RESULTS
If it is a KE-Phase-I result, the mapper stores it in a latch and
waits for the Phase I1 result which will come in the nexT cycle. A . Simulation Results
If it is a KE-Phase I1 result, the mapper uses the tag information
from the Tag FIFO to determine: 1 ) which Key Buffer Simulation Setup: Trafiic Pattern: Poisson Arrival process;
(according to the CAMID segment) should this result be Buffer Size: RM FIFO=& Key Buffer=& KE F I F O 4 ,
returned to, and 2) which unit in the Key Buffer (according to Round-Robin-Ratio=3; Traffic Load Distnbution among
the S/N segment) should this result be written into. Finally the Key-ID groups: given in Fig 2
mapper combines the 2 results (ofPhase I and 11) into one and
returns it.
E, An Example of the PC Processing Flow

Suppose that the ID-bit selection is based on Rule database
#5, and the four ID-bits are PROT(4), DIP(l), DIP(21), and
SIP(4). The distributed rule table is given in Table I1 (in Section
1II.B). Given a packet PO with 5-tuple: -466.111.140.1,
202.205.4.3, 15333, 80, 6>, the processing flow is the
following (also shown in Fig. 5): Fig. 7 Simulation resuhs (Throughput)
a: The 4-bit Key-ID “0010”is extracted by Distributor. Throughput Performance: The simulation results are given in
a: According to the distributed rule table given by TABLE 11, Fig. 7. One can see that at K=5, the OC-768 throughput is
Key-ID group “0010” is stored in TCAM#I. Suppose that the guaranteed even when the system is heavily loaded (traffic
current S/N value of TCAM#I is “S’,then the CAMID “001” intensity tends to loo%), whether FA or SRR algorithm is
and S/N are combined into the Tag with value “001 10(5+1)”. adopted. This is mainly because the theofetic throughput upper
Then the 5-tuple together with the Tag is pushed into the RM bound at K=5 (5*1OOM/4=125Mpps) is 1.25 times of the
FIFO of TCAM#l. OC768 maximum packet rate ( I OOMpps). In contrast, at K=4,
0: Suppose that, the current queue sizes of the 5 KE FIFOs are the throughput falls short of the wire-speed when SRR is used,
2,0,1,I, and I, respectively. According to the FA algorithm, the while FA performs fairly well, indicating that FA has better
KE operation of packet PO is to be performed in T C A M f . load-balancing capability than SRR.
Then the two range subfields 4 5 5 3 5 , 80>, together with the
Tag, are pushed into the KE FIFO associated with TCAM#2. Delay Performance: According to-the processing flow of the
8: Suppose that now it is KE’s turn or no RM task is ready for DPPC-RE scheme, the minimum delay for each PC is 10
execution, PU#2 pops out the head unit (45535, TCAM cycles (5 for RM and 5 for KE). In general, however,
8~+Tag<00100110>)from the KE FIFO, and sends them to additional cycles are needed for a PC because of the queuing
TCAM#2 to pedorm the two range encodings successively. effect. We focus on the performance when the system is heavily
Meanwhile, the corresponding tag is pushed into the Tag FIFO. loaded. Fig. 8 shows the delay distribution for the back-to-back
301
mode, i.e., when packets arrive back-to-back (Traffic intensity when FA are adopted. This means that FA excels in adapting to
=loo%). traffic pattem changes. The performance of SRR is a bit worse
We note that the average delay are reasonably small except (>4% in Case V). Overall, we ma! conclude that the DPPC-RE
scheme copes with the changes of traffic pattern well
for the case at K=4 and when SRR is adopted (avg.delag>20
TCAM cycles). In this case, when the offered load reaches the
theoretical limit (i.e., 100 Mpps), a large number of packets are T.4BLE V Tnroughput ratios in the presetice of trafic pattem changes.
dropped due to SRR's inability to effectively balance the load. Before (%) After ('Yo)
The delay distributions for the cases using FA (K=4 or 5) are 99.76 99.96
much more concentrated than those using SRR, suggesting that
99.24
FA offer much snialler and more deterministic delay 98.99 98.07
performance than SRR. Note that more deterministic delay 92.76 88.39
performance results in less bufferkache requirements and VI 98.38 91.71
l o w r implementation complexity for the TCAM Classifier as
well as other components in the fast data path.
B. Comparison with other schemes
Since each PC operation needs at least 2 TCAM accesses, as
mentioned in Section 111, a single lOOMHz TCAM chip cannot
provide OC768 wire-speed (1OOMpps) PC. So CLP must be
W
= 1K adopted to achieve this goal. Depending on the method of
I \ achieving CLP (to use distributed storage or to duplicate the
E t0 20 30 40 51,
table), and adopting Key Encoding or not, there would be four
Delay (Cycles) Delay (Cycles) different possible schemes. They are:
1) Dtiplicute Table + Rio Key Encoding: Two lOOMHz TCAM
c h p s should be used in parallel to achieve lOOMpps, with each
containing a full, un-encoded rule table (with N entries). A total
number of K X N TCAM entries are required. It is the simplest
to implement and offers deterministic performance (Zero loss
,-, -. rate and fixed processing delay);
.'! !
I
2) Duplicare Table + Ke-v Encoding: Four 1OOMHz TCAM
0 1 '
10 50 3% 40 51
Delay (Cycles) Delay (Cycles)
c h p s should be used in parallel to achieve 1OOMpps, with each
Fig. 8. Delay Distributions of the four simulations. containing an encoded rule table (with Ne entries), A total
number of K x N , TCAM entries are required. It also offers
Change of Traffic Pattern: In order to measure the stability lossly, deterministic performance;
and adaptability of the DPPC-RE scheme when the traffic 3) Distribrtted Storage -t Ke-v Eiicodirig (DPPC-RE): Four or
pattem changes over time, we run the following simulations at five lOOMHz TCAM chips should be used in paraIlel. The total
the Back-to-Back mode (traffic intensity=lOG?!). number of TCAM entries required is Ne x DER [which is not
The traffic pattern depicted in Fig. 2 is denoted as Pattem I linearly proportional to K ) ~It may incur slight loss when
(uneven distribution), and the uniform distribution is denoted as heavily loaded;
Pattem 11. We first construct the distributed table according to 4) Distributed Storage + No Key Encodiiig: Two lOOMHz
one of the patterns and measure the throughput performance TCAM chips should be used in parallel. The total number of
under this traffic pattem. Then we change the trafic to the other TCAM entries required is N xDER Without a dynamic
~
pattern and get the throughput performance again without load-balancing mechanism (which can only be employed when
reconstructing the distributed table. The associated simulation adopting Key encoding), its performance is un-deterministic
setups are given in Table IV. and massive loss may occur when the system is heavily loaded
TABLE IV Simulation setups. or traffic pattem changes.
The TCAM Expansion Ratio ERs (defined as the ratio of the
total number of TCAM entries required to the total number of
rules in the rule database) are calculated for all five real-world
databases based on these four schemes. The results are given in
TABLE VI.
Apparently Distributed+KE, or DPPC-RE significantly
outperforms all the other three schemes in terms of the TCAM
The results are given in Tahle V We fmd that although the storage efficiency. Moreover, with only a slight increase of ER
traffic pattern changes significantly, the throughput €or K=5, compared with K=4, OC-768 wire-speed PC
performance just decreases slightly" (<I%) in all the cases throughput performance can be guaranteed for Distributed +KE
pattern even has a positive effect on the performance. This may be caused b!*
" In Case L the throughput even increase. which indicatesthat tlie changs ofthe the use of the greedy (i.e. not optimum) algorithm for table construction
302
$=5) case. VIII. REFERENCE
[ 11 .'1 Srinivasan, George Varghrse. Subhash Sui. and Marcel Waldvogel,
TABLE VI Comparison oftlie Expansion Ratio "Fast and Scalable Layer Four Switching", h o c . ofACAd SIGCOliIIf I
1998.
[2] T.V. Lahhman, and Dimitrios Stiliadis, "High-Speed Policy-based Packet
Forwarding Using Efficient Multi-dimensional Range hlalching". Proc. of
Expanded Rules (N, I 949 I 553 I 415 I 1638 I 21x0 ACh IS I K O M d . 19%.
After KE (A'?) I 279 I 183 I 158 I 264 I 1550 I [3] hl. Buddhikot. S. Sun, and &.I.WaldvogeL "Space Decomposition
Techniques For Fast h y r - 4 Switching". ProlocoisJor High Speed
Neietworks P (Proceedings ofPflSV '991.
[4] k Feldmann, and S. Muthukrishnart "Tradeoffs for Packet Classification",
Prm. ofIEEE fak7FCC@hf. 2000.
[SI F.Baboescu, S.Sin&, G.Varglizse, "Packet Classification for Core Routers:
Is there an altematiw to CAMS?".Proc. ofIEE/XFOCO~W,Sari Francisco
USA 2003.
[6] Pankaj Gupta. and Nick hlcKeown. "Packet Classification on Multiple
Fields", Proc. ofACMSIGC@M/. 1999.
Obviouslg: DPPC-RE exploits the tradeoff between [q Pankaj Gupta. and Nick hfckeown, "Packet Classification using
Hierarchical Intelligent Cuttings". IEEEh$icroAfuga=ine. Vol. 20, KO.1,
deterministic performance and hi& statistical throughput pp 34-41. January- F e b m i q 2000.
performance, while the schemes with table duplication gains [XI 1'. Srinivasan. S. Suri. and G. Varghese, " Packet Classific&tion using Tuple
high, deterministic performance at significant memor). cost. Space Search". Proc~ofAChI SIGCOA4M,1999.
[9] S.Singh F.Baboescu, GVarghese, and J.Wang, "Packet Classification
Because the combination of Distributed Storage and Range Using Multidimensional Cutting", Proc. V A C h QSIGCOAfAf, 2003.
Encoding (i.e., the DPPC-RE scheme) provides a v e good ~ ~ [IO] E. Spitmagel. D.Taylor, J. Tumer, "Packet Classification Using Exlended
balance in terms of the worst case performance guarantee and TC.4Ms". In Proceedings of International Confirence ofNetwork Protocol
{IChF), September 2003.
low memory cost, it is an attractive solution. [ 113 T.Lakshman and D. Stiliadis, "High-speed poliq-based packet fonvarding
using ef-cient multi-dimensional range matching,,"ACM SIGCOM4
Computer CommunicurmnReview, Vol. 28, No. 4, pp. 203-214, October
VI. DISCUSSION
AND CONCLUSION 1998.
[ 121 H. Liu, "Efcient Mapping of Range Classier into Ternary CAM", Proc. of
ihc 10th Syinposiiim on High Peformance hierconnects fhoti'O,?),.4upust,
Insufficient memoy bandwidth for a single TCAM chip and 2002.
large expansion ratio caused by the range matching problem are [I31 J. van Lunteren and A.P.J. Engbersen. "Dynamic multi-field packet
the two important issues that have to be solved when adopting classication", Proc. of the I . Global Telecommmications Conference
TCAM to build high performance and low cost packet classifier Globeconi'O2; pp. 2215 -2219, November 2002.
[I41 J. van Lunteren and A.P.J. Engbersen, "Fast and Scalable Packet
for next generation multi-gigabit router interfaces. Classicarion", IEEEJournal ofSelecredAreas in Communications, Vol. 2 1,
In this paper, a distributed parallel packet classification NO. 4, pp560-571, h l s 2003.
~
scheme with range encoding (DPPC-RE) i s proposed to [IS] H. Che, 2. U'ang K. Zheng, and B. Liu, "DRES:Dynamic Range Encoding
Sclimie for T C a I Coprocessors," submitted to IEEE Transacfionson
achieve OC768 (40 Gbps) wire-speed packet classification with Coniputers. It is also available online at:
minimum TCAM cost. DPPC-RE includes a rule partition hap:ilcrr~~afi.ut~edu/-zwangidres .pdf.
1161 K.Zheng.C.C.Hu,H.B.Lu, andB.Liu, "AnUbraHighThraughputand
algorithm to distnbute rules into dlfferent TCAM chips with Power Efficient TCAM-Based IP Lookup Engine", Proc. u f l .
minimum redundancy, and a heuristic algorithm to balance the INFQCOM, April, 2004.
traffic load and storage demand among all TCAMs. The [ 17 John L Hemessy: and David A Pattenon, Computer Architecture: A
Quantitative Approach. The Chtnahhchme Press, ISBN 7-111-10921-?(,
implementation details and a comprehensive performance pp. 12 and pp.390-391.
evaluation are also presented. [lX] Please see Acknowledgement.
A key issue that has not been addressed in this paper is how [I91 2, Wang, H. Che, hl. Kumar and S. I; Das, CoPTUA: Consistent Policy
Table Update Algorithm for TCAM without Lacking. IEEE Trunsuctions
to update the rule and range tables with minimum impact on the on Compufers.53(12) 1602-1614,2W4.
packet classification process. The consistent policy table update
algorithm (CoPTUA) [I91 proposed by two of the authors
allows the TCAM policy rule and range table to be updated
without impacting the packet classification process. CoPTUA
can be easily incorporated into the proposed scheme to
eliminate the pedormance impact of rule and range table update.
However, this issue is not discussed in this paper, due to space
limitation.
VII. ACKNOWLEDGEMENT
The authors would like to express their gratitude to Professor
Jonathan Tumer and Mr. David Taylor from Washington University in
St. Louis for kindly sharing their real-world databases and the related
statistics with them.
303

Tcambased Distributed Parallel Packet Classification Algorithm W

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tcambased Distributed Parallel Packet Classification Algorithm W

Uploaded by

Copyright:

Available Formats

TCAM-based Distributed Parallel Packet Classification

Algorithm with Range-Matching Solution’

‘The Department of Computer Science, Tsinghua University, Beijing, P.R.China I00084

sequential range encoding E1 51, which howeverl reduces the

11. DEFINITIONS AND THEOREMS

indeed a practical solution. In what follows, we introduce I 1

IS given by: IQforifrom 1 to 2p do

Sort {k,k = 1,...,K ) in increasing order of&Zd[k] and

Find aK-division { Q,.k groups that

I) Sort f i , i = 1,2,..JP ) in decreasing order of 1 KG, I, and

break, We still use the rule database set #5 as an example. Suppose

variance to create a difficult case for load-balancing.

Fig. 4 Traffic Load Distributioii among the Key-ID (Ley-ID groups).

TABLE 11When K=5. CFAgives the best result. No iteratioil is needed.

C. The Processing Unit

E, An Example of the PC Processing Flow

You might also like