Professional Documents
Culture Documents
Optimizing The Usage of Normalization
Optimizing The Usage of Normalization
Vladimir Weinstein
vweinste@us.ibm.com
Introduction
1. Unicode standard has multiple ways to encode equivalent strings 2. Accents that dont interact are put into a unique order
21st International Unicode Conference Dublin, Ireland, May 2002
NF re D: sume
rsu NFC: m
re sum
rsum e
Introduction (contd.)
Normalization provides a way to transform a string to an unique form (NFD, NFC) Strings that can be transformed to the same form are called canonically equivalent Time-critical applications need to minimize the number of passes over the text ICU gives a number of tools to deal with this problem We will use collation (language-sensitive string comparison) as an example
21st International Unicode Conference Dublin, Ireland, May 2002
Avoiding Normalization
Force users to provide already normalized data The performance problem does not go away When the strings are processed many times, it could be beneficial to normalize them beforehand Forcing users to provide a specific form can be unpopular
Normalize Incrementally
Instead of normalizing the whole string at once, normalize one piece at a time This technique is usually combined with an incremental Quick Check Useful for procedures with early exit, such as string comparing or scanning Normalizes up to the next safe point
If normalized regularly, the whole string is processed by normalization Normalize just the parts that fail quick check
rsu m
Optimized Concatenation
Simple concatenation of two normalized strings can yield a string that is not normalized One option is to normalize the result Unnecessarily duplicates normalization
r + e sum
r e+
It is enough to normalize the boundary parts Incremental normalization is used Much faster than redoing the whole resulting string
21st International Unicode Conference Dublin, Ireland, May 2002
su m r e su m rsum
re sum rsum
10
11
Canonical Closure
Preprocessing data to support the FCD form Ensures that if data is assigned to a sequence (or a code point) it will also be assigned to all canonically equivalent FCD sequences
= X
A-ring (U+00C 5)
= >
= X, A+
Angstrom sign (U+212B)
= X
12
Collation
Locale specific sorting of strings Relation between code points and collation elements Context sensitive:
Contractions: H < Z, but CZ < CH Expansions: OE < < OF Both: < or >
13
Some locales require normalization to be turned on Canonical closure done for contractions and regular mappings Two important services
Sort key generation String compare function
14
15
16
String Compare
Very time critical Result is usually determined before fully processing both strings First step is binary comparison for equality When it fails, comparison continues from a safe spot
No need to backup, normal situation A
21st International Unicode Conference
17
18
19
Conclusion
The presented techniques allow much faster string processing In case of collation, sort key generation gets up to 50% faster than if normalizing beforehand String compare function becomes up to 3 times faster! May increase data size Canonical closure preprocessing takes more time to build, but pays off at runtime
21st International Unicode Conference Dublin, Ireland, May 2002
20
Q&A
21
Summary
Introduction Avoiding normalization Check for normalized text Normalize incrementally Concatenation of normalized strings Accepting the FCD form Implementation of collation in ICU
22