Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

How to be a CSI (encoding Crime Scene Investigator)

Objectives for Crime Scene


Investigation

• Have some fun


• Prevent death by bullet points
• Introduce strategies for analyzing
How to be a CSI (encoding character corruption
• Compare TV Show and Software
Crime Scene Investigator) investigation of mutilated character
corpses
Tex Texin
• The latest version of the presentation is at
Internationalization Architect
www.i18nguy.com/How-To-Be-A-CSI.pdf
Yahoo Inc.
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 2

Our CSI Heroes


Our Software CSI Heroes
TV CSI vs. Software CSI
Gil Grissom
Catherine
Willows
Nick Stokes
Sara Sidle
Greg Sanders
Warrick Brown
Dr. Al Robbins
Capt. Brass

How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 3 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 4

30th Internationalization and Unicode Conference 1


How to be a CSI (encoding Crime Scene Investigator)

Crime Discovery On TV Crime Discovery in Software

A woman calls 911 yelling: “There has been a


• First steps Mojibake! Someone commited Mojibake!”*
• Close off crime scene and gather clues
– Take lots of odd pictures
Characters
– Collect potential forensic evidence have died.
– Identify the victim(s) (or what’s left of them)
Decomposition
– Interview witnesses has already set
• They lie! They presume! They are prejudiced! *Mojibake = Japanese
in.
for garbage characters.
– Put down either the jaded or newbie cop Who did it?
• If it’s suicide then how do you explain…

How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 5 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 6

Crime Discovery in Software Collecting Images

• Characters Corrupted • Often included in bug


– Wrong characters reports
– White or black boxes – Better than nothing
– You can detect patterns
• Collect forensic evidence – But often not helpful
– Screen captures, Keystrokes, Environment
• Better to collect:
• Witnesses/Users: “It happens when I…” – Expected bytes/chars
– “The entire application is busted” – Actual bytes/chars
– They lie, They presume, They’re prejudiced – Fonts referenced, available
– Expected/detected encoding
• The newbie or jaded developer – Keystrokes
“That vendor always does that. Our code is fine.” • A way to reproduce
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 7 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 8

30th Internationalization and Unicode Conference 2


How to be a CSI (encoding Crime Scene Investigator)

Example Case History Analysis On TV

• Developers refused to debug based on • Autopsy: More data collection


an urban legend- • Possibilities: What could have happened?
– “Crime happens. Can’t catch these guys.” • Conjecture: How could it have happened?
– “It’s Microsoft IE’s fault- Our software is fine” • Reconstruction: Does it fit the data?
• Instead of: • Research: Does other data fit the theory?
<meta http-equiv=Content-Type content="text/html; charset=utf-8"> – Always ask the most narrowing question last
• Experiments
• The code said: – Measurements and proof of concept
<meta http-equiv=Content-Type content="text/html; charset=utf8">
(Note the missing hyphen in “utf8”)
– Decay rates under abnormal circumstances

How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 9 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 10

Analysis on TV Analysis in Software

Typical computer queries Typical computer queries


Anatomy Core
and • How many trucks on the road? dumps, • Names of encodings
decay • How many on route 1? architecture • Relations between encodings
science and
• How many on route 1 at 3pm? • Variations of encodings
debugging
Autopsy • How many stopped at Joe’s bar? • Quality of encoding conversions

Trajectory Data flow


Recon- analysis
struction
And points
of
conversion

How to be a CSI (encoding Crime Scene Investigator)


Experiments with controls
copyright © 2006 Tex Texin All rights reserved. 11 How to be a CSI (encoding Crime Scene Investigator)
Known data injection
copyright © 2006 Tex Texin All rights reserved. 12

30th Internationalization and Unicode Conference 3


How to be a CSI (encoding Crime Scene Investigator)

Analysis In Software Character Encoding Negotiation

Unix user Windows user


• Architecture: The Web is complex
– Clients, Browsers, mix technologies (DOM,
java, css, php, active-x, applets, etc.), *ml
standard+non-standard behaviors, OS
dependencies, etc. GB2312 1252
DOM Encoding

– Servers, protocols are also mixed HTML Encoding html html


CSS
HTML D
– Language, locale, encoding, are ambiguous Encoding
CSS
– Negotiation tactics are unreliable Java O
JavaScript
Script M
• But must be understood & kept up-to-date Encoding

– See Web Internationalization Tutorial, et al.


How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 13 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 14

Architecture Architecture

• Data flow and points of (mis)conversion • Other contributors to conversion errors


– Interfaces are common failure points – Missing or incorrect encoding identifier
• Between Legacy, new and 3rd party software – False detection
• API, web services, protocols, devices • e.g. ASCII vs UTF-7
• Data sources, sinks • Too little data to detect
– Types of misconversion • Text not like corpus used for detection statistics
• Missing conversion Source ➽ Source – Encoding name variations
• Incorrect conversion Source ➽ X • “I didn’t know it was a crime. I just did what I was
• backwards conversion Target ➽ Source told!”
• double conversion (Source ➽ Target) ➽ Target

How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 15 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 16

30th Internationalization and Unicode Conference 4


How to be a CSI (encoding Crime Scene Investigator)

Encoding Name Variations Encoding Name Variations

• E.g. ISO 8859-1 vs Windows-1252 • E.g. ISO 8859-1 vs Windows-1252

80-9F Control Codes in ISO 8859-1

1252-specific characters
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 17 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 18

Encoding Naming Variations Encoding Name Variations

• Several variations of big-5, shift-jis, etc. • Poor naming conventions in the industry
– E.g. Microsoft, Apple added chars to shift-jis • Application dependent
• Then there is big5-hkscs
– Hong Kong Supplementary Character Set
• Application dependent names
– Windows IE expects “KSC5601” or “Korean”,
not cp949, windows-949
– Firefox would expect euc-kr
• Sometimes the error is font not encoding
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 19 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 20

30th Internationalization and Unicode Conference 5


How to be a CSI (encoding Crime Scene Investigator)

Encoding Conversions Missing Character Conversions

• Correct conversion is not always obvious:


– 0x5C is it a file separator “\” or currency “¥”?
• (or Won, or other currency) ?
• Half-width or Full-Width?
– Sometimes you need different converters
?

Converts to “?” without


error code.
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 21 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 22

Encoding Conversions Breaking Multibyte Characters

• The family secret: Versioning By not treating all bytes as one character
– Unicode is an evolving standard • Don’t insert bytes in the middle
– As characters are added the opportunity for • Don’t delete 1 byte of a multi-byte char.
improved conversions exist
• Be careful with block boundaries
– “Which Unicode version is the converter for?”
– If I convert data today how will it compare • Don’t treat individual bytes as characters
with the same data converted 3 years ago? – E.g. uppercase a trailing byte
– How long after death before rigor mortis sets – E.g. Treat trailing byte 0x5C as “\” in file
in and when do the maggots come? pathnames

How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 23 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 24

30th Internationalization and Unicode Conference 6


How to be a CSI (encoding Crime Scene Investigator)

Encoding Pathology and


Making Mojibake
Forensics
Splitting multi-byte characters
• Piecing together who did what to whom
日 本 語
Byte type: L T L T L T
– Knowing what was expected and what
Byte types:
Bytes: 9 F 9 7 8 E L = Lead Byte resulted, and comparing with typical patterns
3 A 6 B C A T = Trail Byte of failure, we can deduce what occurred
Inserting “a” (61) in second byte • Do all characters fail or certain ones?
• How do they fail? Wrong characters,
殿 坙 { 語 Question marks, Blackboxes, etc.
Byte type: L T L T S L T
Bytes: 9 6 F 9 7 8 E
3 1 A 6 B C A
a
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 25 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 26

Encoding Forensics and Encoding Forensics and


Pathology Pathology

• Question Marks • Only certain characters misconverted


– Conversion to an encoding that doesn’t have – Variant encoding, out of date conversion,
the characters replaced by “?”. data flow, incomplete font
• All non-Ascii characters are mojibake – E.g. Only euro, trademark, smart quotes, OE
– Missing, wrong, backwards, or 2x conversion ligature,...
• Expected “ç”. Actual: “ç” – Used iso-8859-1 to utf-8 conversion instead
of windows-1252 to utf-8
• Expected 0xE7 (ç). Actual: 0xC3 0xA7 (ç)
• Note that U+00E7 in UTF-8, “ç” is 0xC3 0XA7
• Conclusion: UTF-8 bytes displayed as ISO8859-1
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 27 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 28

30th Internationalization and Unicode Conference 7


How to be a CSI (encoding Crime Scene Investigator)

Encoding
Encoding Research
Error Patterns

• How many trucks stopped at Joes bar?


• Comparing conversion error patterns is like
identifying the gun that fired a bullet by the
shared striation marks • Look for variants you might be incurring
• It’s encoding DNA – www.iana.org/assignments/character-sets
• Often there are obvious candidate patterns – Conversions supported
• See iconv, ICU, or doc for your converter
– “Round up all the usual suspects”
– A stoolie (er… tool) for testing conversions – Check for font versions/updates
• Given an input string, try the usual and expected error
conversions, See if any of the results match
– Other tools: Encoding validator, byte to %hh
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 29 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 30

Data Injection Setting Traps for Criminals

• Testing with live data feeds • Use validation routines at key points of
– like being on a stakeout- Hoping the criminal data acceptance or conversion
will come by while you wait – Especially useful for UTF-8, which has illegal
• Create a pseudo-feed or simulate data byte patterns.
entry for “controlled experiments”. – Reference:
– Known, repeatable values simplify www.w3.org/International/questions/qa-
debugging. forms-utf-8.html
• Less threatening for debugging foreign languages
– Inject directly into subsystems to eliminate
other components as “suspects”.
How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 31 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 32

30th Internationalization and Unicode Conference 8


How to be a CSI (encoding Crime Scene Investigator)

Conclusions: How to be a CSI Thank You!

• Gather forensic evidence ¿Questions?


• Be objective, be thorough
• Understand the architecture, the data
flow, the points of conversion
• Identify the potential patterns of failure

• “Concentrate on what cannot lie. The Note: Images from the CSI TV show and those of CSI cast members are courtesy of CBS Broadcasting Inc. All Rights
Reserved, and/or the actual copyright owners.
evidence” -Gil Grissom The owner and copyright for these photos were generally not documented from the sites I downloaded them. If notified
I will update this file to include the correct copyright.

How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 33 How to be a CSI (encoding Crime Scene Investigator) copyright © 2006 Tex Texin All rights reserved. 34

30th Internationalization and Unicode Conference 9

You might also like