Professional Documents
Culture Documents
UnixPowerTools 3rd
UnixPowerTools 3rd
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Chapter 32
32
Regular Expressions (Pattern Matching)
When my young daughter is struggling to understand the meaning of an idiomatic expression, such as, Someone let the cat out of the bag, before I tell her what it means, I have to tell her that its an expression, that shes not to interpret it literally. (As a consequence, she also uses Thats just an expression to qualify her own remarks, especially when she is unsure about what she has just said.) An expression, even in computer terminology, is not something to be interpreted literally. It is something that needs to be evaluated. Many Unix programs use a special regular expression syntax for specifying what you could think of as wildcard searches through files. Regular expressions describe patterns, or sequences of characters, without necessarily specifying the characters literally. Youll also hear this process referred to as pattern matching. In this chapter, we depart a bit from the usual tips and tricks style of the book to provide an extended tutorial about regular expressions that starts in article 32.4. We did this because regular expressions are so important to many of the tips and tricks elsewhere in the book, and we wanted to make sure that they are covered thoroughly. This tutorial article is accompanied by a few snippets of advice (articles 32.16 and 32.18) and a few tools that help you see what your expressions are matching (article 32.17). Theres also a quick reference (article 32.21) for those of you who just need a refresher. For tips, tricks, and tools that rely on an understanding of regular expression syntax, you have only to look at: Chapter 13, Searching Through Files Chapter 17, vi Tips and Tricks
633 This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.2
Chapter 20, Batch Editing Chapter 34, The sed Stream Editor Chapter 41, Perl OReillys Mastering Regular Expressions, by Jeffrey Friedl, is a gold mine of examples and specifics. DD and TOR
and so grep would then try to find the pattern Array.c in files Bug.c, Comp.c, chap1, and chap2. The simplest solution in most cases is to surround the regular expression with single quotes ('). Another is to use the echo command to echo your command line to see how the shell will interpret the special characters. BB and DG, TOR
* Recent versions of many programs, including find, now support regex via special command-line options. For example, find on my Linux server supports the regex and iregex options, for specifying filenames via a regular expression, case-sensitive and -insensitive, respectively. But the find command on my OS X laptop does not.SJC
634
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.3
Two plus four consists of several constants or literal values and an operator. A calculator program must recognize, for instance, that 2 is a numeric constant and that the plus sign represents an operator, not to be interpreted as the + character. An expression tells the computer how to produce a result. Although it is the sum of two plus four that we really want, we dont simply tell the computer to return a six. We instruct the computer to evaluate the expression and return a value. An expression can be more complicated than 2+4; in fact, it might consist of multiple simple expressions, such as the following:
2 + 3 * 4
A calculator normally evaluates an expression from left to right. However, certain operators have precedence over others: that is, they will be performed first. Thus, the above expression evaluates to 14 and not 20 because multiplication takes precedence over addition. Precedence can be overridden by placing the simple expression in parentheses. Thus, (2+3)*4 or the sum of two plus three times four evaluates to 20. The parentheses are symbols that instruct the calculator to change the order in which the expression is evaluated. A regular expression, by contrast, is descriptive of a pattern or sequence of characters. Concatenation is the basic operation implied in every regular expression. That is, a pattern matches adjacent characters. Look at the following example of a regular expression:
ABE
Each literal character is a regular expression that matches only that single character. This expression describes an A followed by a B followed by an E or simply the string ABE. The term string means each character concatenated to the one preceding it. That a regular expression describes a sequence of characters cant be emphasized enough. (Novice users are inclined to think in higher-level units such as words, and not individual characters.) Regular expressions are case-sensitive; A does not match a. Programs such as grep (13.2) that accept regular expressions must first evaluate the syntax of the regular expression to produce a pattern. They then read the input, line by line, trying to match the pattern. An input line is a string, and to see if a string matches the pattern, a program compares the first character in the string to the first character of the pattern. If there is a match, it compares the
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
635
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.3
second character in the string to the second character of the pattern. Whenever it fails to make a match, it compares the next character in the string to the first character of the pattern. Figure 32-1 illustrates this process, trying to match the pattern abe on an input line.
String of characters (input line). The string abe (pattern).
abe
The abe
In this example there is no match between the first character of the input line and the first character of the pattern. Since it failed to match, the next character of the input line is compared to the first character of the pattern.
canister abe
The first match between a string character on input line and the first character of the pattern occurs in the word canister. Since there is a match, the second character in the pattern is compared to the next character in the input line.
canister abe
The second character in the pattern does not match the next character in the input line. So, returning to the first character in the pattern, the comparison is made to the next character in the input line. There is no match, so the process starts over.
labeled abe
The next match of the first character of the pattern occurs in the word labeled.
labeled abe
Since there is a match, the second character in the pattern is compared to the next character in the input line. In this case there is a match.
labeled abe
Now the third character in the pattern is compared to the next character in the input line. This is also a match. So, the input line matches the pattern.
A regular expression is not limited to literal characters. There is, for instance, a metacharacterthe dot (.)that can be used as a wildcard to match any single character. You can think of this wildcard as analogous to a blank tile in Scrabble where it means any letter. Thus, we can specify the regular expression A.E, and it will match ACE, ABE, and ALE. It matches any character in the position following A. The metacharacter * (the asterisk) is used to match zero or more occurrences of the preceding regular expression, which typically is a single character. You may be familiar with * as a shell metacharacter, where it also means zero or more characters. But that meaning is very different from * in a regular expression. By itself, the metacharacter * does not match anything in a regular expression; it modifies what goes before it. The regular expression .* matches any number of characters. The regular expression A.*E matches any string that matches A.E but it also matches any number of characters between A and E: AIRPLANE, A, FINE, AE, A 34-cent S.A.S.E, or A LONG WAY HOME, for example.
636
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.4
If you understand the difference between . and * in regular expressions, you already know about the two basic types of metacharacters: those that can be evaluated to a single character, and those that modify how characters that precede it are evaluated. It should also be apparent that by use of metacharacters you can expand or limit the possible matches. You have more control over what is matched and what is not. In articles 32.4 and after, Bruce Barnett explains in detail how to use regular expression metacharacters. DD
The caret (^) is an anchor that indicates the beginning of the line. The hash mark is a simple character set that matches the single character #. The asterisk (*) is a modifier. In a regular expression, it specifies that the previous character set can appear any number of times, including zero. As you will see shortly, this is a useless regular expression (except for demonstrating the syntax!). There are two main types of regular expressions: simple (also known as basic) regular expressions and extended regular expressions. (As well see in the next dozen articles, the boundaries between the two types have become blurred as regular expressions have evolved.) A few utilities like awk and egrep use the extended regular expression. Most use the simple regular expression. From now on, if I talk about a regular expression (without specifying simple or extended), I am describing a feature common to both types. For the most part, though, when using modern tools, youll find that extended regular expressions are the rule rather than the exception; it all depends on who wrote the version of the tool youre using and when, and whether it made sense to worry about supporting extended regular expressions.
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
637
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.5
[The situation is complicated by the fact that simple regular expressions have evolved over time, so there are versions of simple regular expressions that support extensions missing from extended regular expressions! Bruce explains the incompatibility at the end of article 32.15. TOR] The next eleven articles cover metacharacters and regular expressions: The anchor characters ^ and $ (article 32.5) Matching a character with a character set (article 32.6) Match any character with . (dot) (article 32.7) Specifying a range of characters with [...] (article 32.8) Exceptions in a character set (article 32.9) Repeating character sets with * (article 32.10) Matching a specific number of sets with \{ and \} (article 32.11) Matching words with \< and \> (article 32.12) Remembering patterns with \(, \), and \1 (article 32.13) Potential problems (article 32.14) Extended regular expressions (article 32.15) BB
638
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
Matches An A at the beginning of a line An A at the end of a line An A anywhere on a line A $A anywhere on a line A ^ at the beginning of a line Same as ^\^ A $ at the end of a line Same as \$$a
Beware! If your regular expression isnt properly quoted, this means process ID of current process. Always quote your expressions properly.
The use of ^ and $ as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses !^ to specify the first argument of the previous line, and !$ is the last argument on the previous line (article 30.8 explains). It is one of those choices that other utilities go along with to maintain consistency. For instance, $ can refer to the last line of a file when using ed and sed. cat v e (12.5, 12.4) marks ends of lines with a $. You might also see it in other programs. BB
You can combine the string with an anchor. The pattern ^From: will match the lines of a mail message (1.21) that identify the sender. Use this pattern with grep to print every address in your incoming mailbox. [If your system doesnt define the environment variable MAIL, try /var/spool/mail/$USER or possibly /usr/spool/ mail/$USER. SJC]
$USER 35.5
Some characters have a special meaning in regular expressions. If you want to search for such a character as itself, escape it with a backslash (\). BB
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
639
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.7
To be specific: a range is a contiguous series of characters, from low to high, in the ASCII character set.* For example, [z-a] is not a range because its backwards. The range [Az] matches both uppercase and lowercase letters, but it also matches the six characters that fall between uppercase and lowercase letters in the ASCII chart: [, \, ], ^, _, and '. BB
* Some languages, notably Java and Perl, do support Unicode regular expressions, but as Unicode generally subsumes the ASCII 7-bit character set, regular expressions written for ASCII will work as well.
640
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.10
Matches Any digit Any character other than a digit Any digit or a Any digit or a Any character except a digit or a Any digit or a ] Any digit followed by a ] Any digit or any character between 9 and z Any digit, a , or a ]
Many languages have adopted the Perl regular expression syntax for ranges; for example, \w is equivalent to any word character or [A-Za-z0-9_], while \W matches anything but a word character. See the perlre(1) manual page for more details. BB
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
641
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.11
At first glance, it might seem that starting the count at zero is stupid. Not so. Looking for an unknown number of characters is very important. Suppose you wanted to look for a digit at the beginning of a line, and there may or may not be spaces before the digit. Just use ^ * to match zero or more spaces at the begin ning of the line. If you need to match one or more, just repeat the character set. That is, [09]* matches zero or more digits and [09][09]* matches one or more digits. BB
Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of times specified by the first number.
The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a character. For example, a literal period is matched by \. and a literal asterisk is matched by \*. However, if a backslash is placed before a <, >, {, }, (, or ) or before a digit, the backslash turns on a special meaning. This was done because these special functions were added late in the life of regular expressions. Changing the meaning of {, }, (, ), <, and > would have broken old expressions. (This is a horrible crime punishable by a year of hard labor writing COBOL programs.) Instead, adding a backslash added functionality without breaking old programs. Rather than complain about the change, view it as evolution.
You must remember that modifiers like * and \{1,5\} act as modifiers only if they follow a character set. If they were at the beginning of a pattern, they would not be modifiers. Table 32-3 is a list of examples and the exceptions.
642
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.12 Table 32-3. Regular expression pattern repetition examples Regular expression
* \* \\ ^* ^A* ^A\* ^AA* ^AA*B ^A\{4,8\}B ^A\{4,\}B ^A\{4\}B \{4,8\} A{4,8}
Matches Any line with a * Any line with a * Any line with a \ Any line starting with a * Any line Any line starting with an A* Any line starting with one A Any line starting with one or more As followed by a B Any line starting with four, five, six, seven, or eight As followed by a B Any line starting with four or more As followed by a B Any line starting with an AAAAB Any line with a {4,8} Any line with an A{4,8}
BB
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
643
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.13
* In Perl, you can also use $1 through $9 and even beyond, with the right switches, in addition to the backslash mechanism.
644
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.15
answers this problem by defining a variety of greedy and non-greedy regular expressions, which allow you to specify which behavior you want. See the perlre(1) manual page for details. BB
All lines starting with From: or Subject: will be printed. There is no easy way to do this with simple regular expressions. You could try something like ^[FS][ru][ob][mj]e*c*t*: and hope you dont have any lines that start with Sromeet:. Extended expressions dont have the \< and \> characters. You can compensate by using the alternation mechanism. Matching the word the in the beginning, middle, or end of a sentence or at the end of a line can be done with the extended regular expression (^| )the([^a-z]|$). There are two choices before the word: a space or the beginning of a line. Following the word, there must be something besides a lowercase letter or else the end of the line. One extra bonus with extended regular expressions is the ability to use the *, +, and ? modifiers after a (...) grouping. [If youre on a Darwin system and use Apple Mail or one of the many other clients, you can grep through your mail files locally. For Mail, look in your home directorys Library/Mail/ directory. There should be a subdirectory there, perhaps named something like iTools:example@mail.example.com, with an IMAP
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
645
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.16
directory tree beneath it. IMAP stores messages individually, not in standard Unix mbox format, so there is no way to look for all matches in a single mailbox by grepping a single file, but fortunately, you can use regular expressions to construct a file list to search. :-) SJC] Here are two ways to match a simple problem, an easy problem, as well as a problem; the second expression is more exact:
% egrep "a[n]? (simple|easy)? ?problem" data % egrep "a[n]? ((simple|easy) )?problem" data
I promised to explain why the backslash characters dont work in extended regular expressions. Well, perhaps the \{...\} and \<...\> could be added to the extended expressions, but it might confuse people if those characters are added and the \(...\) are not. And there is no way to add that functionality to the extended expressions without changing the current usage. Do you see why? Its quite simple. If ( has a special meaning, then \( must be the ordinary character. This is the opposite of the simple regular expressions, where ( is ordinary and \( is special. The usage of the parentheses is incompatible, and any change could break old programs. If the extended expression used (...|...) as regular characters, and \(...\|...\) for specifying alternate patterns, then it is possible to have one set of regular expressions that has full functionality. This is exactly what GNU Emacs (19.1) does, by the wayit combines all of the features of regular and extended expressions with one syntax. BB
32.16
This process is virtually the same kind of process that a programmer follows to develop a program. Step 1 might be considered the specification, which should reflect an understanding of the problem to be solved as well as how to solve it. Step 2 is analogous to the actual coding of the program, and step 3 involves running the program and testing it against the specification. Steps 2 and 3 form a loop that is repeated until the program works satisfactorily. Testing your description of what you want to match ensures that the description works as expected. It usually uncovers a few surprises. Carefully examining the results of a test, comparing the output against the input, will greatly improve your understanding of regular expressions. You might consider evaluating the results of a pattern-matching operation as follows: Hits The lines that I wanted to match. Misses The lines that I didnt want to match. Misses that should be hits The lines that I didnt match but wanted to match. Hits that should be misses The lines that I matched but didnt want to match. Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the hits that should be misses by limiting the possible matches, and you try to capture the misses that should be hits by expanding the possible matches. The difficulty is especially apparent when you must describe patterns using fixed strings. Each character you remove from the fixed-string pattern increases the number of possible matches. For instance, while searching for the string what, you determine that youd like to match What as well. The only fixed-string pattern that will match What and what is hat, the longest string common to both. It is obvious, though, that searching for hat will produce unwanted matches. Each character you add to a fixed-string pattern decreases the number of possible matches. The string them is going to produce fewer matches than the string the. Using metacharacters in patterns provides greater flexibility in extending or narrowing the range of matches. Metacharacters, used in combination with literals or other metacharacters, can be used to expand the range of matches while still eliminating the matches that you do not want. DD
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
647
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.17
showmatch
For example:
% showmatch 'CD-...' mbox and CD-ROM publishing. We have recognized ^^^^^^ that documentation will be shipped on CD-ROM; however, ^^^^^^
xgrep
xgrep is a related script that simply retrieves only the matched text. This allows you to extract patterned data from a file. For example, you could extract only the numbers from a table containing both text and numbers. Its also great for counting the number of occurrences of some pattern in your file, as shown below. Just be sure that your expression matches only what you want. If you arent sure, leave off the wc command and glance at the output. For example, the regular expression [0-9]* will match numbers like 3.2 twice: once for the 3 and again for the 2! You want to include a dot (.) and/or comma (,), depending on how your numbers are written. For example: [0-9][.0-9]* matches a leading digit, possibly followed by more dots and digits.
648
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.18
Remember that an expression like [0-9]* will match zero numbers (because * means zero or more of the preceding character). That expression can make xgrep run for a very long time! The following expression, which matches one or more digits, is probably what you want instead:
xgrep "[0-9][0-9]*" files | wc -l
The xgrep shell script runs the sed commands below, replacing $re with the regular expression from the command line and $x with a CTRL-b character (which is used as a delimiter). Weve shown the sed commands numbered, like 5>; these are only for reference and arent part of the script:
1> 2> 3> 4> \$x$re$x!d s//$x&$x/g s/[^$x]*$x// s/$x[^$x]*$x/\ /g 5> s/$x.*//
Command 1 deletes all input lines that dont contain a match. On the remaining lines (which do match), command 2 surrounds the matching text with CTRL-b delimiter characters. Command 3 removes all characters (including the first delimiter) before the first match on a line. When theres more than one match on a line, command 4 breaks the multiple matches onto separate lines. Command 5 removes the last delimiter, and any text after it, from every output line. Greg Ubben revised showmatch and wrote xgrep. JP, DD, andTOR
Lets imagine an HTML table with lots of entries, each of which has two quoted strings, as shown below:
<td><a href="#arts"><img src="d_arrow.gif" border=0></a>
All the text in each line of the table is the same, except the text inside the quotes. To match the line through the first quoted string, a novice might describe the pattern with the following regular expression:
<td><a href=".*">
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
649
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.19
However, the pattern ends up matching almost all of the entry because the second quotation mark in the pattern matches the last quotation mark on the line! If you know how many quoted strings there are, you can specify each of them:
<td><a href=".*"><img src=".*" border=0></a>
Although this works as youd expect, some line in the file might not have the same number of quoted strings, causing misses that should be hitsyou simply want the first argument. Heres a different regular expression that matches the shortest possible extent between two quotation marks:
"[^"]*"
It matches a quote, followed by any number of characters that do not match a quote, followed by a quote. Note, however, that it will be fooled by escaped quotes, in strings such as the following:
$strExample = "This sentence contains an escaped \" character.";
The use of what we might call negated character classes like this is one of the things that distinguishes the journeyman regular expression user from the novice. DD and JP
[A-Z][A-Z]
^.*, [A-Z][A-Z]
[0-9]\{3\}-[0-9]\{2\}-[0-9]\ {4\}=
650
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.20
Table 32-4. Some useful regular expressions (continued) Item U.S. telephone number Unformatted dollar amounts HTML/SGML/XML tags troff macro with first argument troff macro with all arguments Blank lines Entire line One or more spaces Example (547-5800) ($1); ($ 1000000.00) (<h2>); (<UL COMPACT>) (.SH SEE ALSO) (.Ah Tips for ex & vi) Regular expression
[0-9]\{3\}-[0-9]\{4\} \$ *[0-9]+(\.[0-9][0-9])?
DD and JP
ed
ex
vi
sed
awk
grep
egrep
Action Match any character. Match zero or more preceding. Match beginning of line. Match end of line. Escape character following. Match one from a set. Store pattern for later replay. Match a range of instances.
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
651
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.21
ed
ex
vi
sed
awk
grep
egrep
Action Match words beginning or end. Match one or more preceding. Match zero or one preceding. Separate choices to match. Group expressions to match.
In ed, ex, and sed, note that you specify both a search pattern (on the left) and a replacement pattern (on the right). The metacharacters in Table 32-5 are meaningful only in a search pattern. ed, ex, and sed support the additional metacharacters in Table 32-6 that are valid only in a replacement pattern.
Table 32-6. Valid metacharacters for replacement patterns Symbol
\ \n & ~ \u \U \l \L \E \e
ex
sed
ed
Action Escape character following. Reuse pattern stored by \( \) pair number \n. Reuse previous search pattern. Reuse previous replacement pattern. Change character(s) to uppercase. Change character(s) to lowercase. Turn off previous \U or \L. Turn off previous \u or \l.
DG
What does it match? Match any single character except newline. Match any number (including none) of the single characters that immediately precede it. The preceding character can also be a regular expression. For example, since . (dot) means any character, .* means match any number of any character.
652
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.21
What does it match? Match the following regular expression at the beginning of the line. Match the preceding regular expression at the end of the line. Match any one of the enclosed characters. A hyphen (-) indicates a range of consecutive characters. A caret (^) as the first character in the brackets reverses the sense: it matches any one character not in the list. A hyphen or a right square bracket (]) as the first character is treated as a member of the list. All other metacharacters are treated as members of the list.
\{n,m\}
Match a range of occurrences of the single character that immediately precedes it. The preceding character can also be a regular expression. \{n\} will match exactly n occurrences, \{n,\} will match at least n occurrences, and \{n,m\} will match any number of occurrences between n and m. Turn off the special meaning of the character that follows (except for \{ and \(, etc., where it turns on the special meaning of the character that follows). Save the pattern enclosed between \( and \) into a special holding space. Up to nine patterns can be saved on a single line. They can be replayed in substitutions by the escape sequences \1 to \9. Match characters at beginning (\<) or end (\>) of a word. Match one or more instances of preceding regular expression. Match zero or one instances of preceding regular expression. Match the regular expression specified before or after. Apply a match to the enclosed group of regular expressions.
\ \( \)
\< \> + ? | ( )
The characters in Table 32-8 have special meaning only in replacement patterns.
Table 32-8. Special characters in replacement patterns Pattern
\ \n & \u \U \l \L
What does it do? Turn off the special meaning of the character that follows. Restore the nth pattern previously saved by \( and \). n is a number from 1 to 9, with 1 starting on the left. Reuse the string that matched the search pattern as part of the replacement pattern. Convert first character of replacement pattern to uppercase. Convert replacement pattern to uppercase. Convert first character of replacement pattern to lowercase. Convert replacement pattern to lowercase.
Note that many programs, especially perl, awk, and sed, implement their own programming languages and often have much more extensive support for regular expressions. As such, their manual pages are the best place to look when you wish to confirm which expressions are supported or whether the program supports more than simple regular expressions. On many systems, notably those
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
653
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.21
with a large complement of GNU tools, the regular expression support is astonishing, and many generations of tools may be implemented by one program (as with grep, which also emulates the later egrep in the same program, with widely varying support for expression formats based on how the program is invoked). Dont make the mistake of thinking that all of these patterns will work everywhere in every program with regex support, or of thinking that this is all there is.
Examples of Searching
When used with grep or egrep, regular expressions are surrounded by quotes. (If the pattern contains a $, you must use single quotes from the shell; e.g., 'pattern'.) When used with ed, ex, sed, and awk, regular expressions are usually surrounded by / (although any delimiter works). Table 32-9 has some example patterns.
Table 32-9. Search pattern examples Pattern
bag ^bag bag$ ^bag$ [Bb]ag b[aeiou]g b[^aeiou]g b.g ^...$ ^\. ^\.[a-z][a-z] ^\.[a-z]\{2\} ^[^.] bugs* "word" "*word"* [A-Z][A-Z]* [A-Z]+ [A-Z].* [A-Z]* [a-zA-Z]
Second letter is a vowel. Second letter is a consonant (or uppercase or symbol). Second letter is any character. Any line containing exactly three characters. Any line that begins with a . (dot). Same, followed by two lowercase letters (e.g., troff requests). Same as previous, grep or sed only. Any line that doesnt begin with a . (dot).
bug, bugs, bugss, etc.
A word in quotes. A word, with or without quotes. One or more uppercase letters. Same, extended regular expression format. An uppercase letter, followed by zero or more characters. Zero or more uppercase letters. Any letter.
654
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.21
What does it match? Any symbol (not a letter or a number). One of the numbers 5, 6, or 7. One of the words five, six, or seven. One of the numbers 8086, 80286, or 80386. One of the words company or companies. Words like theater or the. Words like breathe or the. The word the. Five or more zeros in a row. U.S. Social Security number (nnn-nn-nnnn).
Result Redo the entire line, but add parentheses. Change a word list into mv commands. Delete blank lines. ex version of previous. Delete blank lines, plus lines containing only spaces or TABs. ex version of previous. Turn one or more spaces into one space. ex version of previous. Turn a number into an item label (on the current line). Repeat the substitution on the first occurrence. Same. Same, but for all occurrences on the line. Same. Repeat the substitution globally. Change word to uppercase, on current line to last line. Lowercase entire file.
:g/^[ tab]*$/d
s/ */ /g :%s/ */ /g
:s :& :sg :&g :%&g
:s/[0-9]/Item &:/
:.,$s/Fortran/\U&/g :%s/.*/\L&/
Chapter 32: Regular Expressions (Pattern Matching) This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
655
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.
32.21
Result Uppercase first letter of each word on current line (useful for titles). Globally change a word to No. Globally change a different word to No (previous replacement). Transpose words. Transpose, using hold buffers to preserve case.
DG
656
Part VI: Scripting This is the Title of the Book, eMatter Edition Copyright 2006 OReilly & Associates, Inc. All rights reserved.
Chapter 32. Regular Expressions (Pattern Matching). Unix Power Tools, Third Edition, ISBN: 0-596-00330-7
Prepared for echipbk@gmail.com, Son Nguyen Copyright 2002 O'Reilly Media, Inc.. This download file is made available for personal use only and is subject to the Terms of Service. Any other use requires prior written consent from the copyright owner. Unauthorized use, reproduction and/or distribution are strictly prohibited and violate applicable laws. All rights reserved.