Professional Documents
Culture Documents
To The BMP and Beyond
To The BMP and Beyond
Eric Muller
Adobe Systems
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 1
Content
1. Why Unicode
2. Character model
3. Principles of the Abstract Character Set
4. The characters in 5.0
5. Development of the standard
6. Processing
7. Unicode and other standards
8. Resources
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 2
Part I
Why Unicode
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 3
ASCII
128 characters
= 41
supports meaningful exchange of text data
very limited: not even adequate for English:
Adobe®
he said “Hi!”
résumé†
cañon
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 4
Many other standards
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 5
Unicode
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 6
Part II
Character model
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 7
Four layers
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 8
Abstract character set
character:
the smallest component of written language that has semantic value
wide variation across scripts
alphabetic, syllabary, abugidas, abjad, logographic
even within scripts, e.g. “ch”:
two components in English
one component in Spanish?
abstract character:
a unit of information used for the organization, control, or
representation of textual data.
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 9
Coded character set
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 10
17 Planes
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 11
Private Use Area
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 12
Surrogate code points and scalar values
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 13
Character encoding forms: UTFs
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 14
UTF-8
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 15
UTF-16
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 16
UTF-32
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 17
Character encoding schemes
abstract character:
the letter of the Latin script
coded character:
name: LATIN CAPITAL LETTER A
code point: U+0041
encoding forms:
UTF-8: 41
UTF-16: 0041
UTF-32: 00000041
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 19
Character Hiragana MA
abstract character:
the letter of the Hiragana script
coded character:
name: HIRAGANA LETTER MA
code point: U+307E
encoding forms:
UTF-8: E3 81 BE
UTF-16: 307E
UTF-32: 0000307E
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 20
Character Deseret AY
abstract character:
the letter of the Deseret script
coded character:
name: DESERET CAPITAL LETTER AY
code point: U+1040C
encoding forms:
UTF-8: F0 90 90 8C
UTF-16: D801 DC0C
UTF-32: 0001040C
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 21
Terminology
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 22
Part III
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 23
Principles
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 24
Characters, not glyphs
/ / / / /
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 25
Plain text only
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 26
Unification, within each script, across languages
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 27
Well-defined semantics for characters
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 28
Dynamic composition
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 29
Dynamic composition (2)
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 30
Equivalence for precomposed forms
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 31
Characters are stored in logical order
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 32
Round-tripping with some other standards
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 33
Part IV
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 34
How many?
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 35
The scripts
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 36
The scripts (2)
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 37
The scripts (3)
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 38
Block descriptions
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 39
Code charts
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 40
Code charts (2)
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 41
Part V
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 42
Unicode Inc.
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 43
Technical committees
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 44
Standards
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 45
CLDR and LDML
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 46
The Unicode Standard
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 47
ISO/IEC 10646
defined by JTC1/SC2/WG2
aligned with Unicode, via cooperation
same repertoire
same character names
same code points
does not define properties or processing
current: ISO/IEC 10646:2003 + Amendments 1 and 2 + 5 characters
corresponds to Unicode 5.0
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 48
Part VI
Processing
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 49
Properties and algorithms
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 50
The Bidirectional Algorithm
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 51
The Bidirectional Algorithm (2)
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 52
The Bidirectional Algorithm (3)
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 53
Unicode Normalization Forms
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 54
Canonical Decomposition Property
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 55
NFD
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 56
NFC
transform to NFD
recompose combining sequences
R S T
decompose
U & S T NFD
reorder
U S & T
combine
V & T
combine
W T
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 57
NFKD, NFKC
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 58
Case mappings
simple mappings:
one to one
context independent
avoid: insufficient for e.g. ligatures, German
complex mappings:
one to many: X 22, or Y 4Z
contextual: [ \ in final position, not ]
local-sensitive: ^ _ in Turkish, not Z
case folding for caseless matches
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 59
Unicode Collation Algorithm
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 60
Part VII
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 61
Transcoding
Unicode
chars
UTF-8
enc. UTF-16
UTF-32
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 62
Transcoding
Unicode BAR
chars
UTF-8
enc. UTF-16 foo
UTF-32
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 63
Transcoding
Unicode BAR
chars
UTF-8
enc. UTF-16 foo
UTF-32
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 64
Transcoding
Unicode BAR
chars
UTF-8
enc. UTF-16 foo
UTF-32
foo
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 65
Transcoding
Unicode
chars
UTF-8
enc. UTF-16
UTF-32
foo
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 66
Transcoding
Unicode
chars
UTF-8 ASCII
enc. UTF-16 JIS
UTF-32 ISCII
foo ...
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 67
Character Hiragana MA (revisited)
abstract character:
the letter of the Hiragana script
coded character:
name: HIRAGANA LETTER MA
code point: U+307E
encoding forms:
UTF-8: E3 81 BE
UTF-16: 307E
UTF-32: 0000307E
ASCII: n/a
JIS 0208: 245E
KSX 1001: 2A5E
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 68
XML
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 69
Anatomy of an implementation
JIS
UTF-16
UTF-8
transcoding
layer
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 70
Anatomy of an implementation (2)
JIS
UTF-16 UTF-16
UTF-8
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 71
JIS X 0208:1997 and JIS X 0213:2000
standard:
Japan
two complementary standards
market will probably demand both
characters:
JIX X 0208: 7k
JIS X 0213: 4k
~300 in plane 2
mappings:
mapping to Unicode 3.2 does not use PUA
encoding:
2 bytes
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 72
GB 18030-2000
standard:
People’s Republic of China
mandatory for products sold there
characters:
28k
mappings:
mapping to Unicode 3.2, BMP only, uses PUA for 25 characters
mapping to Unicode 4.1: no PUA
encoding:
1, 2, or 4 bytes
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 73
HKSCS
standard:
Hong Kong SAR
supplements BIG5
mandatory for selling to the government
characters:
4,818 characters
1,651 in plane 2
mappings:
mapping to Unicode 3.0 uses PUA for 1,686 characters
mapping to Unicode 3.1-4.0 uses PUA for 35 characters
mapping to Unicode 4.1 does not use the PUA
encoding:
2 bytes
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 74
The bad news
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 75
Part VIII
Resources
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 76
Unicode Inc. Resources
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 77
Internationalization and Unicode Conference
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 78
Other Resources
© Adobe Systems - To the BMP and beyond! July 20, 2006 - Slide 79