UnixPowerTools 3rd

Table of Contents
Chapter 32. Regular Expressions (Pattern Matching).............................................................. 1

Section 32.1. That's an Expression.................................................................................................................................................................................................. 1 Section 32.2. Don't Confuse Regular Expressions with Wildcards................................................................................................................................................ 2 Section 32.3. Understanding Expressions...................................................................................................................................................................................... 3 Section 32.4. Using Metacharacters in Regular Expressions......................................................................................................................................................... 5 Section 32.5. Regular Expressions: The Anchor Characters ^ and $............................................................................................................................................. 6 Section 32.6. Regular Expressions: Matching a Character with a Character Set........................................................................................................................... 7 Section 32.7. Regular Expressions: Match Any Character with . (Dot)......................................................................................................................................... 8 Section 32.8. Regular Expressions: Specifying a Range of Characters with [...]........................................................................................................................... 8 Section 32.9. Regular Expressions: Exceptions in a Character Set................................................................................................................................................ 9 Section 32.10. Regular Expressions: Repeating Character Sets with *.......................................................................................................................................... 9 Section 32.11. Regular Expressions: Matching a Specific Number of Sets with \ { and \ }......................................................................................................... 10 Section 32.12. Regular Expressions: Matching Words with \ < and \ >....................................................................................................................................... 11 Section 32.13. Regular Expressions: Remembering Patterns with \ (, \ ), and \1....................................................................................................................... 12 Section 32.14. Regular Expressions: Potential Problems............................................................................................................................................................. 12 Section 32.15. Extended Regular Expressions.............................................................................................................................................................................. 13 Section 32.16. Getting Regular Expressions Right........................................................................................................................................................................ 14 Section 32.17. Just What Does a Regular Expression Match?...................................................................................................................................................... 16 Section 32.18. Limiting the Extent of a Match.............................................................................................................................................................................. 17 Section 32.19. I Never Meta Character I Didn't Like.................................................................................................................................................................... 18 Section 32.20. Valid Metacharacters for Different Unix Programs.............................................................................................................................................. 19 Section 32.21. Pattern Matching Quick Reference with Examples.............................................................................................................................................. 20
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Chapter 32
32
Regular Expressions (Pattern Matching)
When my young daughter is struggling to understand the meaning of an idiomatic expression, such as, Someone let the cat out of the bag, before I tell her what it means, I have to tell her that its an expression, that shes not to interpret it literally. (As a consequence, she also uses Thats just an expression to qualify her own remarks, especially when she is unsure about what she has just said.) An expression, even in computer terminology, is not something to be interpreted literally. It is something that needs to be evaluated. Many Unix programs use a special regular expression syntax for specifying what you could think of as wildcard searches through files. Regular expressions describe patterns, or sequences of characters, without necessarily specifying the characters literally. Youll also hear this process referred to as pattern matching. In this chapter, we depart a bit from the usual tips and tricks style of the book to provide an extended tutorial about regular expressions that starts in article 32.4. We did this because regular expressions are so important to many of the tips and tricks elsewhere in the book, and we wanted to make sure that they are covered thoroughly. This tutorial article is accompanied by a few snippets of advice (articles 32.16 and 32.18) and a few tools that help you see what your expressions are matching (article 32.17). Theres also a quick reference (article 32.21) for those of you who just need a refresher. For tips, tricks, and tools that rely on an understanding of regular expression syntax, you have only to look at: Chapter 13, Searching Through Files Chapter 17, vi Tips and Tricks
32.1 Thats an Expression
633 This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
32.2
Chapter 20, Batch Editing Chapter 34, The sed Stream Editor Chapter 41, Perl OReillys Mastering Regular Expressions, by Jeffrey Friedl, is a gold mine of examples and specifics. DD and TOR
32.2 Dont Confuse Regular Expressions with Wildcards

Before we even start talking about regular expressions, a word of caution for beginners: regular expressions can be confusing because they look a lot like the file-matching patterns (wildcards) the shell uses. Both the shell and programs that use regular expressions have special meanings for the asterisk (*), question mark (?), parentheses (()), square brackets ([]), and vertical bar (|, the pipe). Some of these characters even act the same wayalmost. Just remember, the shells, find, and some others generally use filename-matching patterns and not regular expressions.* You also have to remember that shell wildcards are expanded before the shell passes the arguments to the program. To prevent this expansion, the special characters in a regular expression must be quoted (27.12) when passed as an argument from the shell. The command:
$ grep [A-Z]*.c chap[12]
Licensed by Son Nguyen 3014544
could, for example, be interpreted by the shell as:

grep Array.c Bug.c Comp.c chap1 chap2
and so grep would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. The simplest solution in most cases is to surround the regular expression with single quotes ('). Another is to use the echo command to echo your command line to see how the shell will interpret the special characters. BB and DG, TOR
* Recent versions of many programs, including find, now support regex via special command-line options. For example, find on my Linux server supports the regex and iregex options, for specifying filenames via a regular expression, case-sensitive and -insensitive, respectively. But the find command on my OS X laptop does not.SJC
634
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
32.3
32.3 Understanding Expressions

You are probably familiar with the kinds of expressions that a calculator interprets. Look at the following arithmetic expression:
2 + 4
Two plus four consists of several constants or literal values and an operator. A calculator program must recognize, for instance, that 2 is a numeric constant and that the plus sign represents an operator, not to be interpreted as the + character. An expression tells the computer how to produce a result. Although it is the sum of two plus four that we really want, we dont simply tell the computer to return a six. We instruct the computer to evaluate the expression and return a value. An expression can be more complicated than 2+4; in fact, it might consist of multiple simple expressions, such as the following:
2 + 3 * 4
A calculator normally evaluates an expression from left to right. However, certain operators have precedence over others: that is, they will be performed first. Thus, the above expression evaluates to 14 and not 20 because multiplication takes precedence over addition. Precedence can be overridden by placing the simple expression in parentheses. Thus, (2+3)*4 or the sum of two plus three times four evaluates to 20. The parentheses are symbols that instruct the calculator to change the order in which the expression is evaluated. A regular expression, by contrast, is descriptive of a pattern or sequence of characters. Concatenation is the basic operation implied in every regular expression. That is, a pattern matches adjacent characters. Look at the following example of a regular expression:
ABE
Each literal character is a regular expression that matches only that single character. This expression describes an A followed by a B followed by an E or simply the string ABE. The term string means each character concatenated to the one preceding it. That a regular expression describes a sequence of characters cant be emphasized enough. (Novice users are inclined to think in higher-level units such as words, and not individual characters.) Regular expressions are case-sensitive; A does not match a. Programs such as grep (13.2) that accept regular expressions must first evaluate the syntax of the regular expression to produce a pattern. They then read the input, line by line, trying to match the pattern. An input line is a string, and to see if a string matches the pattern, a program compares the first character in the string to the first character of the pattern. If there is a match, it compares the
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
635
32.3
second character in the string to the second character of the pattern. Whenever it fails to make a match, it compares the next character in the string to the first character of the pattern. Figure 32-1 illustrates this process, trying to match the pattern abe on an input line.
String of characters (input line). The string abe (pattern).
The canister must be labeled.

The pattern is compared character by character, to the input line.
abe
The abe
In this example there is no match between the first character of the input line and the first character of the pattern. Since it failed to match, the next character of the input line is compared to the first character of the pattern.
canister abe
The first match between a string character on input line and the first character of the pattern occurs in the word canister. Since there is a match, the second character in the pattern is compared to the next character in the input line.
canister abe
The second character in the pattern does not match the next character in the input line. So, returning to the first character in the pattern, the comparison is made to the next character in the input line. There is no match, so the process starts over.
labeled abe
The next match of the first character of the pattern occurs in the word labeled.
labeled abe
Since there is a match, the second character in the pattern is compared to the next character in the input line. In this case there is a match.
labeled abe
Now the third character in the pattern is compared to the next character in the input line. This is also a match. So, the input line matches the pattern.
Figure 32-1. Interpreting a regular expression
A regular expression is not limited to literal characters. There is, for instance, a metacharacterthe dot (.)that can be used as a wildcard to match any single character. You can think of this wildcard as analogous to a blank tile in Scrabble where it means any letter. Thus, we can specify the regular expression A.E, and it will match ACE, ABE, and ALE. It matches any character in the position following A. The metacharacter * (the asterisk) is used to match zero or more occurrences of the preceding regular expression, which typically is a single character. You may be familiar with * as a shell metacharacter, where it also means zero or more characters. But that meaning is very different from * in a regular expression. By itself, the metacharacter * does not match anything in a regular expression; it modifies what goes before it. The regular expression .* matches any number of characters. The regular expression A.*E matches any string that matches A.E but it also matches any number of characters between A and E: AIRPLANE, A, FINE, AE, A 34-cent S.A.S.E, or A LONG WAY HOME, for example.
636
32.4
If you understand the difference between . and * in regular expressions, you already know about the two basic types of metacharacters: those that can be evaluated to a single character, and those that modify how characters that precede it are evaluated. It should also be apparent that by use of metacharacters you can expand or limit the possible matches. You have more control over what is matched and what is not. In articles 32.4 and after, Bruce Barnett explains in detail how to use regular expression metacharacters. DD
32.4 Using Metacharacters in Regular Expressions

There are three important parts to a regular expression: Anchors Specify the position of the pattern in relation to a line of text. Character sets Match one or more characters in a single position. Modifiers Specify how many times the previous character set is repeated. The following regular expression demonstrates all three parts:
^#*
The caret (^) is an anchor that indicates the beginning of the line. The hash mark is a simple character set that matches the single character #. The asterisk (*) is a modifier. In a regular expression, it specifies that the previous character set can appear any number of times, including zero. As you will see shortly, this is a useless regular expression (except for demonstrating the syntax!). There are two main types of regular expressions: simple (also known as basic) regular expressions and extended regular expressions. (As well see in the next dozen articles, the boundaries between the two types have become blurred as regular expressions have evolved.) A few utilities like awk and egrep use the extended regular expression. Most use the simple regular expression. From now on, if I talk about a regular expression (without specifying simple or extended), I am describing a feature common to both types. For the most part, though, when using modern tools, youll find that extended regular expressions are the rule rather than the exception; it all depends on who wrote the version of the tool youre using and when, and whether it made sense to worry about supporting extended regular expressions.
637
32.5
[The situation is complicated by the fact that simple regular expressions have evolved over time, so there are versions of simple regular expressions that support extensions missing from extended regular expressions! Bruce explains the incompatibility at the end of article 32.15. TOR] The next eleven articles cover metacharacters and regular expressions: The anchor characters ^ and $ (article 32.5) Matching a character with a character set (article 32.6) Match any character with . (dot) (article 32.7) Specifying a range of characters with [...] (article 32.8) Exceptions in a character set (article 32.9) Repeating character sets with * (article 32.10) Matching a specific number of sets with \{ and \} (article 32.11) Matching words with \< and \> (article 32.12) Remembering patterns with $, $, and \1 (article 32.13) Potential problems (article 32.14) Extended regular expressions (article 32.15) BB
32.5 Regular Expressions: The Anchor Characters ^ and $

Most Unix text facilities are line-oriented. Searching for patterns that span several lines is not easy to do. [But it is possible (13.9, 13.10).JP] You see, the endof-line character is not included in the block of text that is searched. It is a separator, and regular expressions examine the text between the separators. If you want to search for a pattern that is at one end or the other, you use anchors. The caret (^) is the starting anchor, and the dollar sign ($) is the end anchor. The regular expression Â will match all lines that start with an uppercase A. The expression A$ will match all lines that end with uppercase A. If the anchor characters are not used at the proper end of the pattern, they no longer act as anchors. That is, the ^ is an anchor only if it is the first character in a regular expression. The $ is an anchor only if it is the last character. The expression $1 does not have an anchor. Neither does 1^. If you need to match a ^ at the beginning of the line or a $ at the end of a line, you must escape the special character by typing a backslash (\) before it. Table 32-1 has a summary.
638
32.6 Table 32-1. Regular expression anchor character examples Pattern

Â A$ A $A ^\^ ^^ \$$ $$
a
Matches An A at the beginning of a line An A at the end of a line An A anywhere on a line A $A anywhere on a line A ^ at the beginning of a line Same as ^\^ A $ at the end of a line Same as \$$a
Beware! If your regular expression isnt properly quoted, this means process ID of current process. Always quote your expressions properly.
The use of ^ and $ as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses !^ to specify the first argument of the previous line, and !$ is the last argument on the previous line (article 30.8 explains). It is one of those choices that other utilities go along with to maintain consistency. For instance, $ can refer to the last line of a file when using ed and sed. cat v e (12.5, 12.4) marks ends of lines with a $. You might also see it in other programs. BB
32.6 Regular Expressions: Matching a Character with a Character Set

The simplest character set is a single character. The regular expression the contains three character sets: t, h, and e. It will match any line that contains the string the, including the word other. To prevent this, put spaces ( ) before and after the pattern: the .

You can combine the string with an anchor. The pattern ^From: will match the lines of a mail message (1.21) that identify the sender. Use this pattern with grep to print every address in your incoming mailbox. [If your system doesnt define the environment variable MAIL, try /var/spool/mail/$USER or possibly /usr/spool/ mail/$USER. SJC]
$USER 35.5
% grep '^From: ' $MAIL
Some characters have a special meaning in regular expressions. If you want to search for such a character as itself, escape it with a backslash (\). BB
639
32.7
32.7 Regular Expressions: Match Any Character with . (Dot)

The dot (.) is one of those special metacharacters. By itself it will match any character except the end-of-line character. The pattern that will match a line with any single character is ^.$. BB
32.8 Regular Expressions: Specifying a Range of Characters with []

If you want to match specific characters, you can use square brackets, [], to identify the exact characters you are searching for. The pattern that will match any line of text that contains exactly one digit is ^[0123456789]$. This is longer than it has to be. You can use the hyphen between two characters to specify a range: ^[09]$. You can intermix explicit characters with character ranges. This pattern will match a single character that is a letter, digit, or underscore: [A-Zaz0-9_]. Character sets can be combined by placing them next to one another. If you wanted to search for a word that: started with an uppercase T, was the first word on a line, had a lowercase letter as its second letter, was three letters long (followed by a space character ( )), and had a lowercase vowel as its third letter, the regular expression would be:
^T[a-z][aeiou]
To be specific: a range is a contiguous series of characters, from low to high, in the ASCII character set.* For example, [z-a] is not a range because its backwards. The range [Az] matches both uppercase and lowercase letters, but it also matches the six characters that fall between uppercase and lowercase letters in the ASCII chart: [, \, ], ^, _, and '. BB
* Some languages, notably Java and Perl, do support Unicode regular expressions, but as Unicode generally subsumes the ASCII 7-bit character set, regular expressions written for ASCII will work as well.
640
32.10
32.9 Regular Expressions: Exceptions in a Character Set

You can easily search for all characters except those in square brackets by putting a caret (^) as the first character after the left square bracket ([). To match all characters except lowercase vowels, use [âeiou]. Like the anchors in places that cant be considered an anchor, the right square bracket (]) and dash () do not have a special meaning if they directly follow a [. Table 32-2 has some examples.
Table 32-2. Regular expression character set examples Regular expression
[09] [^09] [09] [09] [^09] []09] [09]] [099z] []09]
Matches Any digit Any character other than a digit Any digit or a Any digit or a Any character except a digit or a Any digit or a ] Any digit followed by a ] Any digit or any character between 9 and z Any digit, a , or a ]
Many languages have adopted the Perl regular expression syntax for ranges; for example, \w is equivalent to any word character or [A-Za-z0-9_], while \W matches anything but a word character. See the perlre(1) manual page for more details. BB
32.10 Regular Expressions: Repeating Character Sets with *

The third part of a regular expression is the modifier. It is used to specify how many times you expect to see the previous character set. The special character * (asterisk) matches zero or more copies. That is, the regular expression 0* matches zero or more zeros, while the expression [09]* matches zero or more digits. This explains why the pattern ^#* is useless (32.4), as it matches any number of #s at the beginning of the line, including zero. Therefore, this will match every line, because every line starts with zero or more #s.
641
32.11
At first glance, it might seem that starting the count at zero is stupid. Not so. Looking for an unknown number of characters is very important. Suppose you wanted to look for a digit at the beginning of a line, and there may or may not be spaces before the digit. Just use ^ * to match zero or more spaces at the begin ning of the line. If you need to match one or more, just repeat the character set. That is, [09]* matches zero or more digits and [09][09]* matches one or more digits. BB
32.11 Regular Expressions: Matching a Specic Number of Sets with \ { and \ }

You cannot specify a maximum number of sets with the * modifier. However, some programs (32.20) recognize a special pattern you can use to specify the minimum and maximum number of repeats. This is done by putting those two numbers between \{ and \}. Having convinced you that \{ isnt a plot to confuse you, an example is in order. The regular expression to match four, five, six, seven, or eight lowercase letters is:
[a-z]\{4,8\}
Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of times specified by the first number.
The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. For example, a literal period is matched by \. and a literal asterisk is matched by \*. However, if a backslash is placed before a <, >, {, }, (, or ) or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of {, }, (, ), <, and > would have broken old expressions. (This is a horrible crime punishable by a year of hard labor writing COBOL programs.) Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the change, view it as evolution.
You must remember that modifiers like * and \{1,5\} act as modifiers only if they follow a character set. If they were at the beginning of a pattern, they would not be modifiers. Table 32-3 is a list of examples and the exceptions.
642
32.12 Table 32-3. Regular expression pattern repetition examples Regular expression
* \* \\ ^* Â* Â\* ÂA* ÂA*B Â\{4,8\}B Â\{4,\}B Â\{4\}B \{4,8\} A{4,8}
Matches Any line with a * Any line with a * Any line with a \ Any line starting with a * Any line Any line starting with an A* Any line starting with one A Any line starting with one or more As followed by a B Any line starting with four, five, six, seven, or eight As followed by a B Any line starting with four or more As followed by a B Any line starting with an AAAAB Any line with a {4,8} Any line with an A{4,8}
BB
32.12 Regular Expressions: Matching Words with \ < and \ >

Searching for a word isnt quite as simple as it at first appears. The string the will match the word other. You can put spaces before and after the letters and use this regular expression: the . However, this does not match words at the begin ning or the end of the line. And it does not match the case where there is a punctuation mark after the word. There is an easy solutionat least in many versions of ed, ex, vi, and grep. The characters \< and \> are similar to the ^ and $ anchors, as they dont occupy a position of a character. They anchor the expression between to match only if it is on a word boundary. The pattern to search for the words the and The would be: \<[tT]he\>. Lets define a word boundary. The character before the t or T must be either a newline character or anything except a letter, digit, or underscore ( _ ). The character after the e must also be a character other than a digit, letter, or underscore, or it could be the end-of-line character. BB
643
32.13
32.13 Regular Expressions: Remembering Patterns with \ (, \ ), and \1

Another pattern that requires a special mechanism is searching for repeated words. The expression [a-z][a-z] will match any two lowercase letters. If you wanted to search for lines that had two adjoining identical letters, the above pattern wouldnt help. You need a way to remember what you found and see if the same pattern occurs again. In some programs, you can mark part of a pattern using $ and $. You can recall the remembered pattern with \ followed by a single digit.* Therefore, to search for two identical letters, use $[a-z]$\1. You can have nine different remembered patterns. Each occurrence of $ starts a new pattern. The regular expression to match a five-letter palindrome (e.g., radar) is: \([a-z]$$[a-z]$[a-z]\2\1. [Some versions of some programs cant handle  in the same regular expression as \1, etc. In all versions of sed, youre safe if you use  on the pattern side of an s commandand \1, etc., on the replacement side (34.11). JP] BB
32.14 Regular Expressions: Potential Problems

Before I discuss the extensions that extended expressions (32.15) offer, I want to mention two potential problem areas. The \< and \> characters were introduced in the vi editor. The other programs didnt have this ability at that time. Also, the \{min,max\} modifier is new, and earlier utilities didnt have this ability. This makes it difficult for the novice user of regular expressions, because it seems as if each utility has a different convention. Sun has retrofitted the newest regular expression library to all of their programs, so they all have the same ability. If you try to use these newer features on other vendors machines, you might find they dont work the same way. The other potential point of confusion is the extent of the pattern matches (32.17). Regular expressions match the longest possible pattern. That is, the regular expression A.*B matches AAB as well as AAAABBBBABCCCCBBBAAAB. This doesnt cause many problems using grep, because an oversight in a regular expression will just match more lines than desired. If you use sed, and your patterns get carried away, you may end up deleting or changing more than you want to. Perl
* In Perl, you can also use $1 through $9 and even beyond, with the right switches, in addition to the backslash mechanism.
644
32.15
answers this problem by defining a variety of greedy and non-greedy regular expressions, which allow you to specify which behavior you want. See the perlre(1) manual page for details. BB
32.15 Extended Regular Expressions

At least two programs use extended regular expressions: egrep and awk. [perl uses expressions that are even more extended. JP] With these extensions, special characters preceded by a backslash no longer have special meaning: \{, \}, \ <, \>, $, $, as well as \digit. There is a very good reason for this, which I will delay explaining to build up suspense. The question mark (?) matches zero or one instance of the character set before it, and the plus sign (+) matches one or more copies of the character set. You cant use \{ and \} in extended regular expressions, but if you could, you might consider ? to be the same as \{0,1\} and + to be the same as \{1,\}. By now, you are wondering why the extended regular expressions are even worth using. Except for two abbreviations, there seem to be no advantages and a lot of disadvantages. Therefore, examples would be useful. The three important characters in the expanded regular expressions are (, |, and ). Parentheses are used to group expressions; the vertical bar acts an an OR operator. Together, they let you match a choice of patterns. As an example, you can use egrep to print all From: and Subject: lines from your incoming mail [which may also be in /var/spool/mail/$USER. JP]:
% egrep '^(From|Subject): ' /usr/spool/mail/$USER
All lines starting with From: or Subject: will be printed. There is no easy way to do this with simple regular expressions. You could try something like ^[FS][ru][ob][mj]e*c*t*: and hope you dont have any lines that start with Sromeet:. Extended expressions dont have the \< and \> characters. You can compensate by using the alternation mechanism. Matching the word the in the beginning, middle, or end of a sentence or at the end of a line can be done with the extended regular expression (^| )the([â-z]|$). There are two choices before the word: a space or the beginning of a line. Following the word, there must be something besides a lowercase letter or else the end of the line. One extra bonus with extended regular expressions is the ability to use the *, +, and ? modifiers after a (...) grouping. [If youre on a Darwin system and use Apple Mail or one of the many other clients, you can grep through your mail files locally. For Mail, look in your home directorys Library/Mail/ directory. There should be a subdirectory there, perhaps named something like iTools:example@mail.example.com, with an IMAP
645
32.16
directory tree beneath it. IMAP stores messages individually, not in standard Unix mbox format, so there is no way to look for all matches in a single mailbox by grepping a single file, but fortunately, you can use regular expressions to construct a file list to search. :-) SJC] Here are two ways to match a simple problem, an easy problem, as well as a problem; the second expression is more exact:
% egrep "a[n]? (simple|easy)? ?problem" data % egrep "a[n]? ((simple|easy) )?problem" data
I promised to explain why the backslash characters dont work in extended regular expressions. Well, perhaps the \{...\} and \<...\> could be added to the extended expressions, but it might confuse people if those characters are added and the $...$ are not. And there is no way to add that functionality to the extended expressions without changing the current usage. Do you see why? Its quite simple. If ( has a special meaning, then $ must be the ordinary character. This is the opposite of the simple regular expressions, where ( is ordinary and \( is special. The usage of the parentheses is incompatible, and any change could break old programs. If the extended expression used (...|...) as regular characters, and \(...\|...$ for specifying alternate patterns, then it is possible to have one set of regular expressions that has full functionality. This is exactly what GNU Emacs (19.1) does, by the wayit combines all of the features of regular and extended expressions with one syntax. BB
32.16 Getting Regular Expressions Right

Writing regular expressions involves more than learning the mechanics. You not only have to learn how to describe patterns, but you also have to recognize the context in which they appear. You have to be able to think through the level of detail that is necessary in a regular expression, based on the context in which the pattern will be applied. The same thing that makes writing regular expressions difficult is what makes writing them interesting: the variety of occurrences or contexts in which a pattern appears. This complexity is inherent in language itself, just as you cant always understand an expression (32.1) by looking up each word in the dictionary. The process of writing a regular expression involves three steps: 1. Knowing what you want to match and how it might appear in the text. 2. Writing a pattern to describe what you want to match. 3. Testing the pattern to see what it matches.
646 Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
32.16
This process is virtually the same kind of process that a programmer follows to develop a program. Step 1 might be considered the specification, which should reflect an understanding of the problem to be solved as well as how to solve it. Step 2 is analogous to the actual coding of the program, and step 3 involves running the program and testing it against the specification. Steps 2 and 3 form a loop that is repeated until the program works satisfactorily. Testing your description of what you want to match ensures that the description works as expected. It usually uncovers a few surprises. Carefully examining the results of a test, comparing the output against the input, will greatly improve your understanding of regular expressions. You might consider evaluating the results of a pattern-matching operation as follows: Hits The lines that I wanted to match. Misses The lines that I didnt want to match. Misses that should be hits The lines that I didnt match but wanted to match. Hits that should be misses The lines that I matched but didnt want to match. Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the hits that should be misses by limiting the possible matches, and you try to capture the misses that should be hits by expanding the possible matches. The difficulty is especially apparent when you must describe patterns using fixed strings. Each character you remove from the fixed-string pattern increases the number of possible matches. For instance, while searching for the string what, you determine that youd like to match What as well. The only fixed-string pattern that will match What and what is hat, the longest string common to both. It is obvious, though, that searching for hat will produce unwanted matches. Each character you add to a fixed-string pattern decreases the number of possible matches. The string them is going to produce fewer matches than the string the. Using metacharacters in patterns provides greater flexibility in extending or narrowing the range of matches. Metacharacters, used in combination with literals or other metacharacters, can be used to expand the range of matches while still eliminating the matches that you do not want. DD
647
32.17
32.17 Just What Does a Regular Expression Match?

One of the toughest things to learn about regular expressions is just what they do match. The problem is that a regular expression tends to find the longest possible matchwhich can be more than you want. Heres a simple script called showmatch that is useful for testing regular expressions, when writing sed scripts, etc. Given a regular expression and a filename, it finds lines in the file matching that expression, just like grep, but it uses a row of carets (^^^^) to highlight the portion of the line that was actually matched. Depending on your system, you may need to call nawk instead of awk; most modern systems have an awk that supports the syntax introduced by nawk, however.
#! /bin/sh # showmatch - mark string that matches pattern pattern=$1; shift awk 'match($0,pattern) > 0 { s = substr($0,1,RSTART-1) m = substr($0,1,RLENGTH) gsub (/[^\b- ]/, " ", s) gsub (/./, "^", m) printf "%s\n%s%s\n", $0, s, m }' pattern="$pattern" $*
showmatch
For example:
% showmatch 'CD-...' mbox and CD-ROM publishing. We have recognized ^^^^^^ that documentation will be shipped on CD-ROM; however, ^^^^^^
xgrep
xgrep is a related script that simply retrieves only the matched text. This allows you to extract patterned data from a file. For example, you could extract only the numbers from a table containing both text and numbers. Its also great for counting the number of occurrences of some pattern in your file, as shown below. Just be sure that your expression matches only what you want. If you arent sure, leave off the wc command and glance at the output. For example, the regular expression [0-9]* will match numbers like 3.2 twice: once for the 3 and again for the 2! You want to include a dot (.) and/or comma (,), depending on how your numbers are written. For example: [0-9][.0-9]* matches a leading digit, possibly followed by more dots and digits.
648
32.18
Remember that an expression like [0-9]* will match zero numbers (because * means zero or more of the preceding character). That expression can make xgrep run for a very long time! The following expression, which matches one or more digits, is probably what you want instead:
xgrep "[0-9][0-9]*" files | wc -l
The xgrep shell script runs the sed commands below, replacing $re with the regular expression from the command line and $x with a CTRL-b character (which is used as a delimiter). Weve shown the sed commands numbered, like 5>; these are only for reference and arent part of the script:
1> 2> 3> 4> \$x$re$x!d s//$x&$x/g s/[^$x]*$x// s/$x[^$x]*$x/\ /g 5> s/$x.*//
Command 1 deletes all input lines that dont contain a match. On the remaining lines (which do match), command 2 surrounds the matching text with CTRL-b delimiter characters. Command 3 removes all characters (including the first delimiter) before the first match on a line. When theres more than one match on a line, command 4 breaks the multiple matches onto separate lines. Command 5 removes the last delimiter, and any text after it, from every output line. Greg Ubben revised showmatch and wrote xgrep. JP, DD, andTOR
32.18 Limiting the Extent of a Match

A regular expression tries to match the longest string possible, which can cause unexpected problems. For instance, look at the following regular expression, which matches any number of characters inside quotation marks:
".*"
Lets imagine an HTML table with lots of entries, each of which has two quoted strings, as shown below:
<td><a href="#arts"><img src="d_arrow.gif" border=0></a>
All the text in each line of the table is the same, except the text inside the quotes. To match the line through the first quoted string, a novice might describe the pattern with the following regular expression:
<td><a href=".*">
649
32.19
However, the pattern ends up matching almost all of the entry because the second quotation mark in the pattern matches the last quotation mark on the line! If you know how many quoted strings there are, you can specify each of them:
<td><a href=".*"><img src=".*" border=0></a>
Although this works as youd expect, some line in the file might not have the same number of quoted strings, causing misses that should be hitsyou simply want the first argument. Heres a different regular expression that matches the shortest possible extent between two quotation marks:
"[^"]*"
It matches a quote, followed by any number of characters that do not match a quote, followed by a quote. Note, however, that it will be fooled by escaped quotes, in strings such as the following:
$strExample = "This sentence contains an escaped \" character.";
The use of what we might call negated character classes like this is one of the things that distinguishes the journeyman regular expression user from the novice. DD and JP
32.19 I Never Meta Character I Didnt Like

Once you know regular expression syntax, you can match almost anything. But sometimes, its a pain to think through how to get what you want. Table 32-4 lists some useful regular expressions that match various kinds of data you might have to deal with in the Unix environment. Some of these examples work in any program that uses regular expressions; others only work with a specific program such as egrep. (Article 32.20 lists the metacharacters that each program accepts.) The means to use a space as part of the regular expression. Bear in mind that you may also be able to use \< and \> to match on word boundaries. Note that these regular expressions are only examples. They arent meant to match (for instance) every occurrence of a city and state in any arbitrary text. But if you can picture what the expression does and why, that should help you write an expression that fits your text.
Table 32-4. Some useful regular expressions Item U.S. state abbreviation U.S. city, state Month day, year U.S. Social Security number Example (NM) (Portland, OR) (JAN 05, 1993); (January 5, 1993) (123-45-6789) Regular expression
[A-Z][A-Z]
^.*, [A-Z][A-Z]
[A-Z][A-Za-z]\{2,8\} [0-9]\ {1,2\}, [0-9]\{4\}
[0-9]\{3\}-[0-9]\{2\}-[0-9]\ {4\}=
650
32.20
Table 32-4. Some useful regular expressions (continued) Item U.S. telephone number Unformatted dollar amounts HTML/SGML/XML tags troff macro with first argument troff macro with all arguments Blank lines Entire line One or more spaces Example (547-5800) ($1); ($ 1000000.00) (<h2>); (<UL COMPACT>) (.SH SEE ALSO) (.Ah Tips for ex & vi) Regular expression
[0-9]\{3\}-[0-9]\{4\} \$ *[0-9]+(\.[0-9][0-9])?
<[^>]*> ^\.[A-Z12]. "[^"]*"
^\.[A-Z12]. ".*" ^$ ^.*$
DD and JP
32.20 Valid Metacharacters for Different Unix Programs

Some regular expression metacharacters are valid for one program but not for another. Those that are available to a particular Unix program are marked by a bullet () in Table 32-5. Quick reference descriptions of each of the characters can be found in article 32.21. [Unfortunately, even this table doesnt give the whole story. For example, Sun has taken some of the extensions originally developed for ed, ex, and vi (such as the \< \> and \{min, max\} modifiers) and added them to other programs that use regular expressions. So dont be bashfultry things out, but dont be surprised if every possible regular expression feature isnt supported by every program. In addition, there are many programs that recognize regular expressions, such as perl, emacs, more, dbx, expr, lex, pg, and less, that arent covered in Daniels table. TOR]
Table 32-5. Valid metacharacters for different programs Symbol .
* ^ $ \ [ ]  \{\}
ed
ex
vi
sed
awk
grep
egrep
Action Match any character. Match zero or more preceding. Match beginning of line. Match end of line. Escape character following. Match one from a set. Store pattern for later replay. Match a range of instances.
651
32.21
Table 32-5. Valid metacharacters for different programs (continued) Symbol

\<\> + ? | ()
ed
ex
vi
sed
awk
grep
egrep
Action Match words beginning or end. Match one or more preceding. Match zero or one preceding. Separate choices to match. Group expressions to match.
In ed, ex, and sed, note that you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters in Table 32-5 are meaningful only in a search pattern. ed, ex, and sed support the additional metacharacters in Table 32-6 that are valid only in a replacement pattern.
Table 32-6. Valid metacharacters for replacement patterns Symbol
\ \n & ~ \u \U \l \L \E \e
ex
sed
ed
Action Escape character following. Reuse pattern stored by  pair number \n. Reuse previous search pattern. Reuse previous replacement pattern. Change character(s) to uppercase. Change character(s) to lowercase. Turn off previous \U or \L. Turn off previous \u or \l.
DG
32.21 Pattern Matching Quick Reference with Examples

Article 32.4 gives an introduction to regular expressions. This article is intended for those of you who need just a quick listing of regular expression syntax as a refresher from time to time. It also includes some simple examples. The characters in Table 32-7 have special meaning only in search patterns.
Table 32-7. Special characters in search patterns Pattern .
*
What does it match? Match any single character except newline. Match any number (including none) of the single characters that immediately precede it. The preceding character can also be a regular expression. For example, since . (dot) means any character, .* means match any number of any character.
652
32.21
Table 32-7. Special characters in search patterns (continued) Pattern

^ $ [ ]
What does it match? Match the following regular expression at the beginning of the line. Match the preceding regular expression at the end of the line. Match any one of the enclosed characters. A hyphen (-) indicates a range of consecutive characters. A caret (^) as the first character in the brackets reverses the sense: it matches any one character not in the list. A hyphen or a right square bracket (]) as the first character is treated as a member of the list. All other metacharacters are treated as members of the list.
\{n,m\}
Match a range of occurrences of the single character that immediately precedes it. The preceding character can also be a regular expression. \{n\} will match exactly n occurrences, \{n,\} will match at least n occurrences, and \{n,m\} will match any number of occurrences between n and m. Turn off the special meaning of the character that follows (except for \{ and $, etc., where it turns on the special meaning of the character that follows). Save the pattern enclosed between \( and $ into a special holding space. Up to nine patterns can be saved on a single line. They can be replayed in substitutions by the escape sequences \1 to \9. Match characters at beginning (\<) or end (\>) of a word. Match one or more instances of preceding regular expression. Match zero or one instances of preceding regular expression. Match the regular expression specified before or after. Apply a match to the enclosed group of regular expressions.
\ 
\< \> + ? | ( )
The characters in Table 32-8 have special meaning only in replacement patterns.
Table 32-8. Special characters in replacement patterns Pattern
\ \n & \u \U \l \L
What does it do? Turn off the special meaning of the character that follows. Restore the nth pattern previously saved by $ and $. n is a number from 1 to 9, with 1 starting on the left. Reuse the string that matched the search pattern as part of the replacement pattern. Convert first character of replacement pattern to uppercase. Convert replacement pattern to uppercase. Convert first character of replacement pattern to lowercase. Convert replacement pattern to lowercase.
Note that many programs, especially perl, awk, and sed, implement their own programming languages and often have much more extensive support for regular expressions. As such, their manual pages are the best place to look when you wish to confirm which expressions are supported or whether the program supports more than simple regular expressions. On many systems, notably those
653
32.21
with a large complement of GNU tools, the regular expression support is astonishing, and many generations of tools may be implemented by one program (as with grep, which also emulates the later egrep in the same program, with widely varying support for expression formats based on how the program is invoked). Dont make the mistake of thinking that all of these patterns will work everywhere in every program with regex support, or of thinking that this is all there is.
Examples of Searching
When used with grep or egrep, regular expressions are surrounded by quotes. (If the pattern contains a $, you must use single quotes from the shell; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / (although any delimiter works). Table 32-9 has some example patterns.
Table 32-9. Search pattern examples Pattern
bag ^bag bag$ ^bag$ [Bb]ag b[aeiou]g b[âeiou]g b.g ^...$ ^\. ^\.[a-z][a-z] ^\.[a-z]\{2\} ^[^.] bugs* "word" "*word"* [A-Z][A-Z]* [A-Z]+ [A-Z].* [A-Z]* [a-zA-Z]
What does it match? The string bag.

bag at beginning of line. bag at end of line. bag as the only word on line. Bag or bag.
Second letter is a vowel. Second letter is a consonant (or uppercase or symbol). Second letter is any character. Any line containing exactly three characters. Any line that begins with a . (dot). Same, followed by two lowercase letters (e.g., troff requests). Same as previous, grep or sed only. Any line that doesnt begin with a . (dot).
bug, bugs, bugss, etc.
A word in quotes. A word, with or without quotes. One or more uppercase letters. Same, extended regular expression format. An uppercase letter, followed by zero or more characters. Zero or more uppercase letters. Any letter.
654
32.21
Table 32-9. Search pattern examples (continued) Pattern

[^0-9A-Za-z] [567]
What does it match? Any symbol (not a letter or a number). One of the numbers 5, 6, or 7. One of the words five, six, or seven. One of the numbers 8086, 80286, or 80386. One of the words company or companies. Words like theater or the. Words like breathe or the. The word the. Five or more zeros in a row. U.S. Social Security number (nnn-nn-nnnn).
Extended regular expression patterns

five|six|seven 80[23]?86 compan(y|ies) \<the the\> \<the\> 0\{5,\} [0-9]\{3\}-[0-9]\{2\}-[0-9]\{4\}
Examples of Searching and Replacing

Table 32-10 shows the metacharacters available to sed or ex. (ex commands begin with a colon.) A space is marked by ; a TAB is marked by tab.
Table 32-10. Search and replace commands Command

s/.*/( & )/ s/.*/mv & &.old/ /^$/d :g/^$/d /^[ tab]*$/d
Result Redo the entire line, but add parentheses. Change a word list into mv commands. Delete blank lines. ex version of previous. Delete blank lines, plus lines containing only spaces or TABs. ex version of previous. Turn one or more spaces into one space. ex version of previous. Turn a number into an item label (on the current line). Repeat the substitution on the first occurrence. Same. Same, but for all occurrences on the line. Same. Repeat the substitution globally. Change word to uppercase, on current line to last line. Lowercase entire file.
:g/^[ tab]*$/d
s/ */ /g :%s/ */ /g
:s :& :sg :&g :%&g
:s/[0-9]/Item &:/
:.,$s/Fortran/\U&/g :%s/.*/\L&/
655
32.21
Table 32-10. Search and replace commands (continued) Command

:s/\<./\u&/g :%s/yes/No/g :%s/Yes/~/g s/die or do/do or die/ s/$[Dd]ie$ or $[Dd]o$/\2 or \1/
Result Uppercase first letter of each word on current line (useful for titles). Globally change a word to No. Globally change a different word to No (previous replacement). Transpose words. Transpose, using hold buffers to preserve case.
DG
656

UnixPowerTools 3rd

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UnixPowerTools 3rd

Uploaded by

Copyright:

Available Formats

Table of Contents

Chapter 32. Regular Expressions (Pattern Matching).............................................................. 1

32.1 Thats an Expression

32.2 Dont Confuse Regular Expressions with Wildcards

Licensed by Son Nguyen 3014544

could, for example, be interpreted by the shell as:

32.3 Understanding Expressions

The canister must be labeled.

Figure 32-1. Interpreting a regular expression

32.4 Using Metacharacters in Regular Expressions

32.5 Regular Expressions: The Anchor Characters ^ and $

32.6 Table 32-1. Regular expression anchor character examples Pattern

32.6 Regular Expressions: Matching a Character with a Character Set

% grep '^From: ' $MAIL

32.7 Regular Expressions: Match Any Character with . (Dot)

32.8 Regular Expressions: Specifying a Range of Characters with []

32.9 Regular Expressions: Exceptions in a Character Set

32.10 Regular Expressions: Repeating Character Sets with *

32.11 Regular Expressions: Matching a Specic Number of Sets with \ { and \ }

32.12 Regular Expressions: Matching Words with \ < and \ >

32.13 Regular Expressions: Remembering Patterns with \ (, \ ), and \1

32.14 Regular Expressions: Potential Problems

32.15 Extended Regular Expressions

32.16 Getting Regular Expressions Right

32.17 Just What Does a Regular Expression Match?

32.18 Limiting the Extent of a Match

32.19 I Never Meta Character I Didnt Like

[A-Z][A-Za-z]\{2,8\} [0-9]\ {1,2\}, [0-9]\{4\}

<[^>]*> ^\.[A-Z12]. "[^"]*"

^\.[A-Z12]. ".*" ^$ ^.*$

32.20 Valid Metacharacters for Different Unix Programs

Table 32-5. Valid metacharacters for different programs (continued) Symbol

32.21 Pattern Matching Quick Reference with Examples

Table 32-7. Special characters in search patterns (continued) Pattern

What does it match? The string bag.

Table 32-9. Search pattern examples (continued) Pattern

Extended regular expression patterns

Examples of Searching and Replacing

Table 32-10. Search and replace commands Command

Table 32-10. Search and replace commands (continued) Command

You might also like

<[^>]> ^\.[A-Z12]. "[^"]"

^\.[A-Z12]. "." ^$ ^.$