Professional Documents
Culture Documents
Charsets Encodings Java
Charsets Encodings Java
Charsets Encodings Java
Brian Clapper
ArdenTex, Inc.
bmc@ardentex.com
Character Sets, Encodings, Java and Other Headaches
1
Introduction
2
2
Introduction
3
3
4
4
5
5
6
6
7
7
8
8
9
9
ASCII
ISO 8859-1
Windows 1252
Unicode
10
10
11
11
12
12
13
13
14
14
15
15
Note, though, that the encodings of these nonASCII characters turn out never to be the same
as Windows 1252 or ISO 8859-1
16
16
17
17
Common Encodings
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26
No byte-order mark
27
27
No byte-order mark
28
28
29
29
30
30
31
31
32
32
33
33
34
34
35
35
36
36
37
37
Pitfalls
* Whew *
With that out of the way, let's look at some
encoding-related pitfalls.
38
38
39
39
40
40
41
41
42
42
43
43
44
44
45
45
46
46
47
47
48
48
49
49
Examples:
CREATE DATABASE foo CHARACTER SET utf8
CREATE TABLE tbl CHARACTER SET utf8
CREATE TABLE tbl (
col1 VARCHAR(5) CHARACTER SET latin1,
col2 VARCHAR(20) CHARACTER SET utf8
)
50
50
See http://www.postgresql.org/docs/8.4/
static/multibyte.html
Example:
CREATE DATABASE db WITH ENCODING 'UTF8'
$ createdb -E UTF8 db
51
51
52
52
53
53
Nightmare Scenario #1
Problem:
54
54
Nightmare Scenario: #1
55
55
Nightmare Scenario #1
What actually happened:
56
56
Nightmare Scenario #2
57
57
Nightmare Scenario #2
What actually happened:
58
58
Lessons to Learn
59
59
60
60
61
61
Useful Links
http://www.unicode.org/
http://www.joelonsoftware.com/articles/Unicode.html
http://en.wikipedia.com/wiki/Unicode
http://en.wikipedia.com/wiki/UTF-16
http://en.wikipedia.com/wiki/UTF-8
62
62
Acknowledgments
The following individuals reviewed the original
draft(s) of this presentation and provided valuable
feedback:
Mark Chadwick
Matt Dymek
Tom Hjellming
Steve Sapovits
Drew Sudell
Jon Tulk
Character Sets, Encodings, Java and Other Headaches
63
63
Thank you!
64
64