Regular Expressions

Seek and You Shall Find: Using Regular Expressions for Fast, Accurate Mobile Device Data Searches
By Michael Harrington, CFCE, EnCE Article Posted: May 07, 2010 In the world of digital forensics, the power to seek and find is key. The faster and more accurate the search, the faster you can zero in on your target and find the evidence you need to convict, prevent, or locate. Regular expressions, long thought of as the arcane art of long-haired network admins squirreled away in front of Bash shell-cursor-blinking terminals, made their way into mainstream forensics with the GREP function in EnCase. (In fact, though the term GREP in the forensic community has become synonymous with regex much in the same way as Kleenex is synonymous with tissue,grep is in fact a linux/unix program that is a regular expression search utility. The grep program, and various other iterations such as egrep, process the regex patterns and return a result.) Adopted by many commercial forensic suites such as FTK, Cellebrites Physical Pro, and MSABs XRY Complete, these deceptively simple yet devilishly complex character patterns hold the key to a powerful process of searching and reporting. Regular Expressions: What They Are and Why You Need Them Forensic examiners, looking at a complex regular expression such as the one below: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b might find themselves wondering what the heck they got into and why they bothered to come in to work that day. However, the effective use of regular expressions might be the difference in solving a case. That is because regular expressions automate and streamline tasks that would take hours if not days to do. For example, the above regular expression matches ALL e-mail addresses. Yes, ALL e-mail addresses. Folded into programming code, then, the expression can greatly enhance the code's efficacy, saving investigative time. What is a regular expression? Regular expressions (or the much easier to write regex) are patterns that describe a certain amount of text. The name regular expression is derived from computer science and mathematical theory, where the expressions reflect a trait called regularity. However, while earlier regular expressions were mathematical in nature, modern regex are not, and the forensic examiner needs to know how to craft the expressions and invoke the software that will search using the expression. Back to the example regular expression: \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b The first two characters \b represent an anchor that starts the match at a word boundary. The next set of characters, [A-Z0-9._%+-]+, tell the regex engine to match between one and unlimited times (the + at the end of the series) a character in the range of A-Z, a character in the range between 0-9, one of the characters ._% the character + and the character -. The next character in the expression matches the @ character literally. The following sequence [A-Z0-9.-]+ repeats the previous long sequence of matching.
The two characters after that, \. are formed from an escape character \ and the dot (which, unless escaped, has the special wildcard meaning of any character). The result of this sequence means to match the dot literally. The next to last sequence [A-Z]{2,4} asks the engine to match a single character from A-Z between two and four times as many times as possible. The last two characters \b again assert the regex at the word boundary. If this regex gives you headaches, not to worrythe rest of this article walks you through the basics to give you the foundation you need to begin crafting your own magical regular expressions. Character Classes One important basic concept to grasp is working with character classes, or sets. A character class performs a search and matches only one character out of a choice of several. An example: searching for variations in British and American English language spellings. For instance, the British spell the wheel of an automobile as tyre, while Americans spell it as tire. To find the way this was spelled and to account for either the UK or US spelling variance, we can use a character class regular expression: [y i]. This would be applied by inserting this character class between the other common letters t[y i]re, so that it would match either spelling. It would not, however, match tyyre or tiire or any other misspelling. The character order inside the class does not matter, i.e. [y i] is equivalent to [i y]. Regular expressions have a handy wild card for matching any single character: the dot. To match our variations of tire above, we could use the dot as in t.re where the dot represents the variance between the i and y characters. While the dot matches any single character, it does not match line breaks such as carriage returns or newlines. However, depending on the regular expression engine you are using, you may be able to activate a special mode that allows the dot to match line breaks in addition to any single character. To match a position rather than a character, regular expressions use anchors. Anchors can be used to match the beginning of a string ^ or the end of a string $. Searching for a Range of Characters Sometimes an examiner may want to search for a range of characters such as in a telephone number or IMSI. To accomplish this, we can add a hyphen between the characters to be searched for within the character class. To search for a single digit one would use [0-9]. You can also use more than one range within the character class; to match a single hexadecimal (as opposed to Unicode) digit you would use [0-9afA-F]. Note that this character class includes the range a-f in order to account for the possibility that the hex digit is specified in upper or lower case. You can use curly braces to create a specific number of times you want a match repeated. For instance, to match a Nokia IMEI, you can use the following regular expression [\x30-\x39]{6}/ [\x30-\x39]{2}/[\x30-\x39]{6}/[\x30-\x39]{2}. Special Characters A number of characters (11 to be exact) have special meaning in regular expressions. The following list details those special characters: Opening square bracket [ Backslash \ Caret ^ Dollar sign $ Period or dot . Vertical bar or pipe symbol | Question mark ?
Asterisk or star * Plus sign + Opening round and closing round bracket () You may see these characters referred to as metacharacters. What are their relevance? Within the character class, if the caret ^ is used immediately after the opening bracket, it negates the character class. This means that the regex engine will search and match anything that is not within the character class. Z[^mbie] will match Zo in Zombie, but it will not match cheez, as there is nothing after the z for the engine to match. The simplest regular expression is the single literal character, for example the letter z. Using this as a regex will match the first occurrence of the invoked character in a string. For instance, if the string to be searched for that letter is The Zentor Corporation created the zombie plague, the letter Z will be matched in Zentor (provided we make sure the search is case insensitive). The regex can be used to find the second occurrence of z in zombie, but the engine has to explicitly be told to find these additional matches. It is often useful to repeat a token (character) in a regular expression. The question mark token ? makes the character that precedes it optional. For instance the following regular expression zombies? will match zombie or zombies. Using the asterisk (*) makes the regular expression search engine look for the character preceding it zero or more times. A plus sign (+) will attempt to match the character before it one or more times. For instance, using the asterisk in the following expression \xFF\xD8\xE0.*\xFF\xD9 tells the engine to look for any character that falls between the hexadecimal values zero or more times accounting for a variable length of JPEG file. However, this expression can be improved. Using the plus sign with the expression \xFF\xD8\xE0.+\xFF\xD9 tells the engine to match the wild card one or more times and will eliminate a sequence where "\xFF\xD8\xE0" is followed immediately by "\xFF\xD9" with no intervening characters. Sometimes in regular expressions you want to provide for more than one possible match. For instance you want to be able to search for a contact in the address book that can be listed walrus OR Paul. To accomplish this one would use the pipe or | symbol between the two alternate selections. In the phrase The Walrus was Paul using the regular expression walrus|Paul will first match walrus. If used again it will match Paul. You can stack alternatives within your regular expressions, e.g. walrus|Paul|egg|lucy|goo. If you want to use a special character in a regular expression as its literal character, you have to do what is called escaping the character. This is accomplished by placing the backslash \ character in front of the metacharacter to cause the regex engine to take it as a literal character without its special meanings. That wasnt so bad, was it? Now you should have a bit of the magical regular expression pixie dust to sprinkle about your mobile forensic examinations. Though we covered only the barest minimum, you should now have a foundation on which to further your understanding of the powerful regex. Regular expressions will help you get into data deeper both faster and more efficiently. Mastering them is well worth the effort. References Goyvaerts, Jan and Leviathan, Steven Regular Expression Cookbook Sebastopol: OReilly Media Inc. 2009 Goyvaerts, Jan Regular-Expressions.info - Regex Tutorial, Examples and Reference Regexp Patterns 15 April 2010, Jan Goyvaerts 27 April 2010 www.regular-expressions.info
Michael Harringon CFCE, EnCE is an independent training and consulting contractor. He has taught mobile forensics around the globe and will be teaching regular expressions, along with other mobile device data analysis techniques, for Teel Technologies at Techno Security 2010. Teel Technologies, 16 Knight Street, Norwalk Connecticut 06851; (203) 855-5389; me@themichaelharrington.com; www.teeltech.com.

Regular Expressions

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regular Expressions

Uploaded by

Copyright:

Available Formats

Seek and You Shall Find: Using Regular Expressions for Fast, Accurate Mobile Device Data Searches

You might also like