Optimizing The Usage of Normalization

Optimizing the Usage of Normalization
Vladimir Weinstein
vweinste@us.ibm.com
Globalization Center of Competency, San Jose, CA

21st International Unicode Conference Dublin, Ireland, May 2002
Introduction
1. Unicode standard has multiple ways to encode equivalent strings 2. Accents that dont interact are put into a unique order

NF re D: sume
rsu NFC: m
re sum
rsum e
Introduction (contd.)
Normalization provides a way to transform a string to an unique form (NFD, NFC) Strings that can be transformed to the same form are called canonically equivalent Time-critical applications need to minimize the number of passes over the text ICU gives a number of tools to deal with this problem We will use collation (language-sensitive string comparison) as an example
Avoiding Normalization
Force users to provide already normalized data The performance problem does not go away When the strings are processed many times, it could be beneficial to normalize them beforehand Forcing users to provide a specific form can be unpopular
21st International Unicode Conference
Dublin, Ireland, May 2002
Check for Normalized Text

Most strings are already in normalized form Quick Check is significantly faster than the full normalization Needs canonical class data and additional data for checking the relation between a code point and a normalization form Algorithm in UAX #15 Annex 8 (http://www. unicode.org/unicode/reports/tr15/#Annex8)
Normalize Incrementally
Instead of normalizing the whole string at once, normalize one piece at a time This technique is usually combined with an incremental Quick Check Useful for procedures with early exit, such as string comparing or scanning Normalizes up to the next safe point
Incremental Normalization: Example

Initial string Non incremental normalization
Quick check Incremen tal normaliz ation
re sume re sume rsu m
If normalized regularly, the whole string is processed by normalization Normalize just the parts that fail quick check
rsu m
Optimized Concatenation
Simple concatenation of two normalized strings can yield a string that is not normalized One option is to normalize the result Unnecessarily duplicates normalization
Optimized Concatenation: Example

Find boundaries Concatenate and normalize up to the boundaries
r + e sum
Concatenate then normalize
r e+
It is enough to normalize the boundary parts Incremental normalization is used Much faster than redoing the whole resulting string
su m r e su m rsum
re sum rsum
Accepting the FCD Form

Fast Composed or Decomposed form is a partially normalized form Not unique More lenient than NFD or NFC form It requires that the procedure has support for all the canonically equivalent strings on input It is possible to quick check the FCD format
10
FCD Form: Examples

SEQUENCE A-ring Angstrom A + ring A + grave A-ring + grave A + cedilla + ring A + ring + cedilla A-ring + cedilla Y FCD Y Y Y Y Y Y Y Y Y NFC Y NFD
11
Canonical Closure
Preprocessing data to support the FCD form Ensures that if data is assigned to a sequence (or a code point) it will also be assigned to all canonically equivalent FCD sequences
= X
A-ring (U+00C 5)
= >
= X, A+
Angstrom sign (U+212B)
= X
A + combining ring above (U+0041 U+030A)

12
Collation
Locale specific sorting of strings Relation between code points and collation elements Context sensitive:
Contractions: H < Z, but CZ < CH Expansions: OE < < OF Both: < or >
See Collation in ICU by Mark Davis

13
Collation Implementation in ICU

Two modes of operation:
Normalization OFF: expects the users to pass in FCD strings Normalization ON: accepts any strings
Some locales require normalization to be turned on Canonical closure done for contractions and regular mappings Two important services
Sort key generation String compare function
More about ICU at the end of presentation

14
FCD Support in Collation

Much higher performance Values assigned to a code point or a contraction are equal to those for its FCD canonically equivalent sequences This process is time consuming, but it is done at build time May increase data set
15
Sort Key Generation

Whole strings are processed Sort keys tend to get reused, so the emphasis is on producing as short sort keys as possible Two modes of operation
Normalization ON: strings are quick checked and normalization is performed, if required Normalization OFF: depends on strings being in FCD form. The performance increases by 20% to 50%
16
String Compare
Very time critical Result is usually determined before fully processing both strings First step is binary comparison for equality When it fails, comparison continues from a safe spot
No need to backup, normal situation A
Must backup to the start of contraction c z c h
Must backup to the normalization safe spot
17
String Compare Continued

Normalization ON: incremental FCD check and incremental FCD normalization if required Normalization OFF: assumes that the source strings are FCD Most locales dont require normalization on and thus are 20% faster by using FCD
18
International Components for Unicode

International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support The ICU normalization engine supports the optimizations mentioned here Library services accept FCD strings as input Wide variety of supported platforms Open source (X license non-viral) C/C++ and JAVA versions http://oss.software.ibm.com/icu/
19
Conclusion
The presented techniques allow much faster string processing In case of collation, sort key generation gets up to 50% faster than if normalizing beforehand String compare function becomes up to 3 times faster! May increase data size Canonical closure preprocessing takes more time to build, but pays off at runtime
20
Q&A
21
Summary
Introduction Avoiding normalization Check for normalized text Normalize incrementally Concatenation of normalized strings Accepting the FCD form Implementation of collation in ICU
22

Optimizing The Usage of Normalization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing The Usage of Normalization

Uploaded by

Copyright:

Available Formats

Optimizing the Usage of Normalization

Globalization Center of Competency, San Jose, CA

21st International Unicode Conference

Dublin, Ireland, May 2002

Check for Normalized Text

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Incremental Normalization: Example

Quick check Incremen tal normaliz ation

re sume re sume rsu m

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

Optimized Concatenation: Example

Concatenate then normalize

Accepting the FCD Form

FCD Form: Examples

21st International Unicode Conference

Dublin, Ireland, May 2002

A + combining ring above (U+0041 U+030A)

21st International Unicode Conference

See Collation in ICU by Mark Davis

Collation Implementation in ICU

More about ICU at the end of presentation

FCD Support in Collation

21st International Unicode Conference

Dublin, Ireland, May 2002

Sort Key Generation

21st International Unicode Conference

Dublin, Ireland, May 2002

Must backup to the start of contraction c z c h

Must backup to the normalization safe spot

Dublin, Ireland, May 2002

String Compare Continued

21st International Unicode Conference

Dublin, Ireland, May 2002

International Components for Unicode

21st International Unicode Conference

Dublin, Ireland, May 2002

21st International Unicode Conference

Dublin, Ireland, May 2002

You might also like