Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

CHAPTER 15

Filters Using Regular


Expressions-grep and sed

Y ou often need to search a file for a pattern--either to see the lines containing
(or not containing) it or to have it replaced by something else. This chapter
discusses two important filters that are specially suited for these tasks--grep-and sed.
grep takes care of all search requirements you may have-and does the job well. sed
goes further and can even m_anipulate the individual characters in a line. In fact sed can
do several things, some of them quite well. ·
This chapter also takes up regular expressions--one of the fascinating features
of UNIX. You've already had a taste of these expressions when using the search capa-
bilities of vi and emac~. But it'.s in· this chapter that you'll see them in all their mani-
festations. The rules for framing the patterns are 'Veil defined, and in no time you'll be
able to devise compact express~ons that perform amazing matches. In fact, it's common
to find just a single line of grep or sed code replacing several tin.es of C code.
The system administrator must be adept in understanding and framing regular
expressions. Learning to use them along with grep ;md sed serves as a suitable prelude -
to learning perl wµich uses many of their features. Toes~ features ,are discussed here
but assumed in the chapter on perl.

Objectives
• Output lines containing a simple string with grep. (15.2)
• Use grep's options to display their count, line numbers, and lines not containing a
pattern. ( 15.3)
• Use a regular expression to search for multiple similar strings. ( 15.4)
• Use egrep and fgrep with multiple patterns. (15.5)
• Learn how egrep uses a special regular expression that can group and delimit mul-
tiple patterns: (15.6)
• Use sed to select and edit lines. (15. 7 to 15. JO)
• Replace one pattern with another in sed, using regular expressions where necessary.
(15.JJ)
• Use the interval and tagged regular expressions to enhance the power of grep and
sed. (15.12)

433
434 Your UN,x. 71.
. 'fie Ufr
'¾e e,
15.1 The Sample Database ""1
In this chapter and the ones dealing with filters and shell program .
referring . to the file emp. l st. Somet1mes,
. you 'll also be usingming' Yo u'II o[J••
. . ~ k
derived from it. It's a good idea to have a close look at the file now r file Or ut
organization: anct uncle rsland1.,,0

$ cat emp.lst
2233jcharles harris jg.m. Isales 112/12/521 90000
9876lbill johnson !director iproductionj03/12/50jl30000
5678lrobert dylan jd.g.m. !marketing j04/19/43 j 85000
2365ljohn woodcock !director jpersonnel j05/11/47jl20000
5423lbarry wood !chairman jadmin j08/30/56jl60000
1006jgordon lightfootldirector Isales j09/03/38jl40000
6213j~ichael lennon lg.m. !accounts I06/05/62jl05000
1265lp.j . woodhouse !manager Isales j09/12/63j 90000
4290 Ineil o' bryan Iexecutive.Iproduction I09 /07 /50 f 65000
2476ljackie wodehousejmanager Isales j05/01/59lll0000
6521jderryk o'brien jdirector !marketing 109/26/451125000
3212jbill wilcocks ld.g.m. !accounts 112/12/551 85000
3564jronie trueman lexecutivelpersonnel I07/06/47j 75000
2345jjames wilcox ig.m. !marketing I03/12/45lll000O ·
0ll0jjulie truman lg.m. !marketing 112/31/401 95000

The first five lines of this file were used as the file shortlist in the section on sort
(9.12). The significance of the fields have also been explained there, but we'll recount
them just the same. This is a text•file containing 15 lines of a personnel database. There
are six fields-empid, name, designation, department, date of birth and salary using the
I as the field delimiter. This character has a special meaning to the shell, so we must
remember to escape it whenever we specify the delimiter. .

15.2 grep: Searching for a Pattern


UNIX has a special family of commands for hantlling -search requirements, and the
principal member of this family is the grep command. It scans a file for the occurrence
of a pattern ana, depending on the options used, displays
• Lin~s containing the selected pattern.
•· Lines not containing the selected pattern. (-v)
• Line numbers where the pattern occurs. (-n)
• Nuf!1ber of lines containing the pattern. (-c}
• Filenames where the pattern occurs. (- l)
grep is exceedingly simple to use too. Its syntax treats the first arg ument as the
pattern and the rest as filenames :

grep options patternfilename(s) .


• S T/US
· ress1on ·
We'll first use simple strings as search patterns and later use regular exp st:
_ the file emp · 1
is how grep displays lines containing the pattern sales f~cz..m
.,-
Jar Expressions-grep and sed
VsiDg Regu
cecS
15: fil
ales emp.lst
$ greP S
Isales
:io
JJjcharles harris jg.m.
6 jgordon lightfootjdirector
p.j , woodhouse !manager
Isales
Isales
112/12/521 90000
I09/03/3Bl14oooo
12651 109/12/631 90000
jjackie wodehousejmanager Isales I05/01/59ll10000
2476
We didn't quote the pattern here; quoting is essential if the search string comprises mul-
tiple words or uses any 0 ~ the shell's characters like*, ? and so forth. However, quoting
doesn't cause any h~ ei ther; grep_ "sales" emp. lst works in just the same way.
Because g~ep ts also a filter, tt can search its standard input for the pattern and
store the output rn a file: l
)
who I grep henry· > foo C
ll
i,
When grep is used with a series of strings, it interprets the first argument as the pat- "C
tern and the rest as filenames. Every line displayed is prefixed by its filename: !,,
$ grep dir~c-!_or empl.lst emp2.1st
empl . lst:1006lgordon lightfootldirector Isales I09/03/38l140000
empl .lst:652llderryk o'brien !director !marketing 109/26/451125000
emp2. lst:9876lbill johnson !director jproductionj03/12/50l130000
emp2 .1st : 23651 john w oodcock - Idirector T personne 1- I05/11747 I120000

15.2.1 Quoting in grep


When you use multiword strings as the pattern, you must quote the pattern. 1f..l'.2!!
don 't, grep treats its first argument as the pattern and the rest as filenames. Try using
grep with the name gordon light fo_o t:

$ grep gordon lightfoot emp.lst


grep: lightfoot: No such file or directory
emp.lst:1006jgordon lightfootjdirector Isales j09/03/38j140000

grep interprets light foot as a filename, and obviously fails to open such a file. How-
ever, its search continues by using the next argument, i.e., emp. l st. Now, quote the
pattern:

$ grep 'gordon lightfoot' emp.lst


1006jgordon lightfootjdirector jsales j09/03/38j140000

Now, let's try to locate nei l o 'bryan from the file:

$ grep 'neil o'bryan' emp. lst _


>

What happened here? The Bourne, Korn and bash shells interpret this as an incompkte
command by issuing a secondary prompt string(>). The same-command run in the C
shell even causes an error: .
Your UNJX- r
436 . he Ut,-
''%,
C,;~
% grep 'neil o'bryan' emp.lst
Unmatched '•
The pattern itself contains a single quote. The shell looks for even
. f . -' number
quotes to determine the boundanes .
o nonmteuerence. We should h
ave re
sof th
that single quotes don't p~otect single quotes, only double quotes do: lllernber~

$ grep "neil o'bryan" emp. lst


4290lneil o'bryan lexecutivelproductionl09/07/50I 65000

Though quotes are redundan_t in single-word fixed strings, it's better to en~
use. It sets up a good habit with no adverse consequences. You can then u~:ce ~eir
expressions inside them. · regular

You need to quote the pattern in grep if the pattern contains more than one w d
. . b h or orsno
cial characters that can be interpreted otherwise y t e shell. You can generally u .".
single or double quotes but ·tI you nee d comman d su bst1tu
. t·10n or variable evaluat·se e1th~
Note I

must use double quotes.

15.2.2 When grep Fails


grep is a representative UNIX command that silently returnsJ.he prompt when the pat-
tern can't be located: · -

$ grep president emp.lst


$ No president footl/

There's more to it here than meets the eye. The command failed because the strini
president couldn't be ·located. Though the feature of scanning a file for a pattern is
available in both sed and awk, these commands are not considered to fail if they can't
locate a pattern in their input.
Don't, however, draw the wrong conclusion from the above behavioral pattern of
grep. The silent return of the shell prompt is no ·evidence of faifure. In fact, the silent
behavior of cmp denotes success. Success or failure is determined by the value of asix·
cial variable($?) that gets set when a command has finished execution. You'll seei:
the chapter on shell programming (18.5.1) how this variable is applied in the comman
line of the shell's programming constructs. ·

15.3 grep Options ' ble


. • ptions (Ta
grep is one of the most frequently used UNIX commands. It uses a iew O [here
15 · I) , an d 1s
· one command most of whose options you must know. F0 rtunately,
aren't too many of them . 'j1li1
Solaris maintains; POSIX-compliant version of grep in /usr/xp 94 /bl:~ethi1
version uses the - E and -F options that are also used by Linux. If you want. to(J7.J).
.
executable, then either use the absolute pathname or change the PA TH setting
ountl
C · . . ?The -c (c 10·
o~ntmg Occurrences (-c) How many directors are there m the file. 0
fthe ·
option cou nts the occurrences; the following example reveals that there are four
·in Regular Expressions-grep and sed
, filters Us g 437
1cr 1• •
1iJr
TABLE 15.1 Options Used by the grep Family

Option Significance
·C Displays count of number of occurrences
-1 Displays list of filenames only
-n Displays line numbers along with lines
•V Doesn't display lines matching expression
.; Ignores case when matching
-h Omits filenames when handling multiple files
·W Matches complete word (grep only)
-e pat Also matches pattern pat beginning with a • (hyphen)
-e pat As above, but can be used multiple times (Linux and some UNIX versions)
-E Treats pattern as an egrep regular expression (Linux and Solaris-xpg4)
-F Matches pattern in fgrep-style (Linux and Solaris-xpg4)
-n Displays line and n lines above and below (Linux only)
-An Displays line and n lines after matching lines (Linux only)
-B n Displays line ~nd n lines before matching lines Wnux only)
-f file Take patterns from file, one per line (Linux only)

$ grep -c 'director' emp.lst


4

This is one of the few grep options that doesn't display the lines at all. If you use this
command with multiple files, the filename is prefixed to the line count:

$ grep -c director emp*.lst


emp. l st :4
empl. lst :2 ,I
emp2 . lst:2
empold. lst:4

Sometimes, you need to get a single count from all these files so that you can use it in
script logic. You have already handled a similar situation before (8.8.1), and you should
be able to use grep in a manner that drops the filenames from the output. Try the -h
option.

~isplaying Line Numbers (-n) The -n (number) option can be used to display the
hne numbers containing the pattern, along with the lines:

$ grep ·n 'marketing.' emp. 1st I .


3= 5678 Irobe rt dyl an Id. g .m. Imarketing I04/19/43 I 85000
ll:652llderryk o'brien !director !marketing 109/26/451125000
14 :2345ljames wilcox lg ,m. !marketing I03/12/45l110000
l5:0llOljul ie truman lg,m . !marketing I12/31/401 95000
438 Your UNfX-1n
· e U/1;,,,_
"''llte c

The line numbers are shown at the beginning of each line, separated f 4ia
0
line by a : . If you use this option with multiple filenames, then you / rn the ac~
additional fields-the filename and the count: ouJd have t
%
$ grep -n 'marketing' emp?.lst I head -2
empl .lst:2:5678jrobert dylan jd.g.m. /marketing /O4/19/4J/
· I09/26/45/ 85 ooo
empl .lst:6:6521jderryk 0 brien jdirector Imar k·et mg
1

125000
Deleting Lines (-v) The -v (inverse) option selects all but the lines co . .
pattern. Thus, you can create a file other list containing all but directors:ntauung ~e

$ grep -v 'director' emp.lst > otherlist


$ wc -1 otherlist
11 other list There were 4 directors . ..
m,t,a/3/

This is a useful option for "deleting" lines, but you can use it effectively only by ap .
1
ing redirection. Obviously, the lines haven't been deleted from the original file as s:C:
We had to create a separate file otherl i st containing all but the directors' lines. ·

L,,,J The - v option removes lines from grep's output, but doesn't actually change the argument file.
Note
Displaying Filenames (-1) The -1 (list) option displays only the names of files
where a pattern has been found:

$ grep -1 'manager ' *.1st


desig . lst
emp.lst
empl.1 st
empn.lst

So, if you have forgotten the filename where you last ·saw something, just use this option
to find out which one has it. This is the second option that doesn't display the lines.

"Ignoring Case (-i) When you look for a name, but are not sure of the case, grep
offers the -i (ignore) option. This makes the match case-insensitive:
$ grep -i 'WILCOX' emp.lst
2345jjames wilcox jg.m. !marketing /03/12/45/110000
This locates the name wi l cox. However, a simple string like this· can ' t match theepoaint
5up-
wil cocks that also exists in the file but is spelled with minor differe~ces. ~rfllrll for
. .
ports very soph1st1cated techniques of pattern matching, and th"1s 1s. the ideal iO
regular expressions to make their entry. ·
(bat
apattefll
Patterns Beginning with a - (-e) What happens when you look ior . .
. w1"th a hyphen? Most systems will show you something s1.·1111·1ar to thJS ·
begms
. R gular Expressions-grep and sed 439
;sing e

$ rep "-mtime" /var/spool/cron/crontabs/*


gp· -mtime illegal option
gre •
grep "[-EI-F] [-cl- 11-q] [-bhinsvx] [-e pattern_list] [-f pattern_file] [pattern_
]ist] [file ... ]

grep treats -mtime as an opti~n of its own, and finds it "illegal." To locate such pat-
terns, you must use the -e option:

$ grep -mtime" /var/spool/cron/crontabs/*


11

romeo:55 17 _* * 4 find/ -name core -mti~e +30 -print

SoJaris offers this option only in its POSIX-compliant version in /usr/xpg4/bin. On


some systems (especially Linux), the -e option can also be used multiple times to let
you match multiple patterns. ·

some•More'. Options
Matching-Mu~tiple Patterns (-e and -f) As just mentioned, the -e option has
an additional use in Linux~With this option, you can match multiple patterns using
a single invocation of the command:

$ grep -e woodhouse -e wood :e woodcock emp.lst


2365ljohn woodcock !director !personnel I05/11/47j120000
5423lbarry wood lchairmari ladmin · I08/30/56l160000
1265 lp,j. woodhouse . !manager ; Isales 109/12/631 ·90000

Quotes aren' t really necessary for single-word arguments, so for a change we


dropped them here. Howev~r. the ted,ium qf, entering such a lengthy command line
is compelling enough to use •regular expressions, which we'll discuss shortly.
You can·put all ilie ,three patterns•in a separate file, one pattern per line. GNU
grep takes input from there with the -f option:

grep -f pattern. 1st emp. 1st


The -f option is also offered by egrep and fgrep, though it is used in differ-
ent ways.

Printing the Neighborhood GNU grep has a nifty option that locates not only
the matching line, but also a certain number of lines above and below it. For instance,
you may want to know what went before and after the f oreach statement that you
used in a perl script
$ grep -1 •foreach • count. p1 One line above and below
/print ("Region List\n") ; -
foreach $r_code sort (keys(%regionlist)) {
print ("$r_code : $regi on{$r_code} : $region 1i st {$r_code}\n")
440
Your UNtx, 7n
. e Ut,;,,,_
"'"ltG

The command locates the s~g :oreach _and ~isplays one line on either .
Isn't this feature useful? Usmg this npmenc opllon, you can locate a . 81de ofit
by supplying a umque. . th at exists
stnng . m . the IIl1'ddle of the code se sechon of cOde·
. and pre1er
You can b~ seIe~llve I:
to di spIa~, al on~ with
. the matched gmentli ·
tain number of Imes either above or below. This requrres the -A and -B n~, acer.
Op!Jons:

~.
grep -A 5 "do loop" update.sq]
5 lines after mat h.
grep -B 3 "do loop" update.sq] 3/ . cmgt;n
mes before mot h' ei
C lng[iflei
It's easier to identify the context of a matched line when the immediate .
. .
, hood 1s also presented. These opllons are also usenil·when searching soned r.
;The previous user-id or the next date of birth are often important things to loo~es.
or.
15.4 Regular Expressions-Round One
View the file emp. l st (15.J) once again, and you'll find some names spelled in .
ilar manner-like trueman and truman, w,. 1coc ks and wi l cox. You'll often wantas1m.
locate both truman and trueman without using grep twice. The command 10

$ grep truman emp.lst


0ll0ljulie truman lg.m. !marketing 112/31/401 95000
doesn't help here as it lists only one line-the one exactly matching the pattern. Like
the shell's wild cards (8.2) which ·match similar filenames with a single expression,
grep also uses an expressi of a different sort to match ,(group of similar patterns.
Unlike wild cards, however, this expression is a feature of the commandthatusesitand
has nothin to do with the shell. It has an elaborate metacharacter set (Table 15.2)over-
shadowing the shell's wild car s an can e orm amazing mate e If~sion
uses any of these characters, lt is termed a regular expressio~. .
- Regular expressions take care of some common query and substitution reqwre·
ments. You may want the system to present a list of similar names, so you can ~lect
exactly the one you require. Or you may want to replace multiple spaces with asmgle
space or display lines that begin with a#. You may even be looking for a string atasr
cific column position in a line. All this is possible (and much more) with regu:
expressions as you' II discover in the three rounds of discussions that feature tbe su
ject in this chapter. •"'
Some of the characters used by regular expressions are also meam·ngful toithIJJ'a
shell-enough reason why these expressions should be quoted. We' II fir5t start : &s·
. . treatment of regular expressions and then expand the coverage when w
mimmal
cuss sed.
urel
ressions are interpreted b the c
Note II isn't able to interfere and interpret the metacharacters 1n 1

15.4.1
The Character Class closes
Like the shell's wild cards, a regular expression also uses a character class that en per·
is (hen
a group of characters within a pair of rectangular brackets [ ] . The match
fonned for a single character in the group. Thus, the expression
. Regular Expressions-grep and sed 441
... [)sing
5' filte,,
i,rt .
(t,lf TA 8 LE 15.2 The Regular Expression Characters Used by grep, sed and per1

Matches

* Zero or more occurrences of previo~s character


g* Nothing or g, gg,· ggg, etc.
gg* 9, 99, ggg, etc.
A single.character
* No th ing or any ~umber of.characters
[pqr] A single -charact~r p, or r q
[abc] a, b or c
[cl-cl] A single character within the ASCII range represented by c1
and cl • ; ••
[1-3] ·A digit between 1 and 3
. '1"\' ,c,, t
["pqr] A sing e fh~ract~r "."hich is not a p, q or r
["a-zA-Z] A nonalphabetic character
"pat 1 ' · · P~tt~r~-i,~(at. beginning'
. . (~
~f line
....... .- .... ,.. .
pat$ Pattern pat at end of line
bash$I b~sh_~te~d of~!ne.1. , $I .;,,.') n, ..1 ill
"bash$ bash ~s the only.word _in line
"$ , Lines containing nothing
... - -. I 1: 1 I !r

\{m\} m occurrences of the previous character (no\ in perl)


(75.12.1)
A, \{9\}nobody nobody after skipping nine characters from line' beginning (no \
in perl) (7 1, 12.1)
\{m, \} At least m occurrences of the previous character (no\ in perl)
(75.12.1) I
Between m and n occurrences of the previous character (nQ \ in
I I
\{m,n\}
perl) (15.12.1)
\ (exp\) Expression exp for later referencing with \1, \2, etc. (no\
before (and) in perl) (75.12.2)
I
\(BOLD\). *\1 At least two occurrences of the string BOLD in a line (no \ before
( and ) in per 1) (15. 1'2.2)
I I
I

[od] Eithero ord

matches either an o or d. You can also use ranges, both for alphabets and numerals.
Thus, the pattern
[a-zA-Z0-9]
I .
matches a single alphanumeric character. This property can now be used to match
Woodhouse and wodehouse. These two patterps p ffer in their third and fourth charac-
ter positions-od in one and de in the other. To match these two strings, we'll have to
use the model [od] [de] which in fact matches all these four patterns:
Od oe dd de
Your UNIX- Th
. e Ult;rria,
442 •o"idt
The first and fourth are_relevant to the present problem. Using the character c
regular expression required to match woodhouse and wodehouse shou\d be thi:s, lhe
wo[od] [de] house
Let's use this regular expression with grep:
$ grep "w~[od][de]house" emp.lst
1265lp.j. wo9dhouse !manager Isales 109/12/631 90000
2476ljackie wodehouselmanager Isales I05/0l/59ll10000

A single pattern has located two similar strings; that's what regular expressi·ons areal]
about.
'-When ranges are used, the character on the left siqe of the - must be lower .
the ASCII collating sequence) than the one on the right. The character class [X- ]\in1
therefore, quite legitimate, as X has a lower ASCII value than c. However, that dc ~'
. h . .th .th oesn1
mean you can match an alphabetic c aracter m e1 er case w1 the expression [A-z]
because, between Z and a, there are a number of other nonalphabetic characters as well
(the caret, for example))

Negating a Class e u ex ressions use the " (caret) to ne ate the character class
while the shell uses the ! (bang). When the character class begins with this c aracte;
all characters other than the ones grouped in the class are matched. So a single nona1'.
phabetic character string is represented by this expression:

["a-zA-Z]

The feature of the character class is similar to the wild cards except that negation of the clasi
Q
Note
is done by a" (caret), while in the shell it's done by the ! (bang).

15.4.2 The*
The* (asterisk) refers to the immediately preceding character. However, its interpreta·
10
tion is the trickiest of the lot. Keep in mind that it bears absolutely no resemblance
the * used by wild cards or DOS. Rather, it matches zero or more occurrences of the
previous character. In other words, the previous character can occur many times, or not
at all. The pattern ·

e*

matches the single character e and any number of es. Because the previous ch~ct~:1
may not occur at all, it also matches a null string. Thus, apart from this null String,
also matches the following strings: ·
e ee eee eeee .....
d10
Mark.bthe words
de . t:ze~ o or more occurrences of the previous character':jpat are use
sion 1°
th
sen e ~ sigruficance of the *. Pon't make the mistake of using this exp~sd ard5
match
d · · with e; use ee* instead. Recall that the * used bYwil c
, a strmg begmnmg
oesn t relate to the previous character at all.
\
•ng Regular Expressions-grep and sed
\
. filters U51 443
()11P1er 15-

The expression e* indicates that


Q . e might not occur at all!

r.Jote How do you now match t rueman and t ,,


·1e the other pattern doesn't Th·. ruman · The first pattern contains an e,
whi · is means that e may ·
expression, and the regular expression that signifi1es th ~r m*ay,!_~t occur at all m the
at 1s e . 1 ms means that
I
true*man
matches
\
the job: the-two patterns. Now use this~expression wi
~th grep an-d-you would have done

$ grep "true*man" emp.lst


3564lro~ie trueman lexecutivelpersonnel \07/06/47\ 75000
0ll0ijulie truman \9- "J • . \111p r.k~ting \12/31/40\ 95000
A simple regular expression using the ·unusual significance of the * matches both
names! But not~ that these·are not the only strings it can match; the expression is gen- I

eral enough to mclude other.patterns. It would have.also matched trueeman had there \
been such a pattern in the file. · .
Using both the character class ,and the. *, we ,can now match wi1 cocks and
wilcox: --

$ grep "wilco[cx]k*s*" emp.lst


3212lb_ill wilcocks Jct.g :m. ·'' \acc?~.nts \12/12'/55\ 85000
1
2345 Ijames wil cox \ g .m. · \marketing \ 03/1214'5 \ 11 0000

The expression k*s* means that k ands may not occur at ~11 (or as many times as pos-
sible); that's why the expression used with grep also matches wi l cox which doesn't
contain these two characters at its end. You can feel the power of regular expressions \
here-and how they easily exceed t~e capi bilities o( wild cards. I

Q ( The * in its s ecial sense atwa


r ression onl if it i \
--expression, then it's treated Ii
\
Note

lS.4.3 The Dot


A . matches a single character. The shell :'Uses the ? character to indicate that. The
pattern · ·

2.•.

matches a four-character pattern beginning with a 2. The sheil's equivalent pattern is


....
2???

The Regular Expression . * Th~ dot along _w ith the *. ( . *) constitutes very useful
regular expression. It signifies any number of characters, or none. S.ay, for mstanc~, y~u
are looking for the name p. wood house, ~ut are not sure w;~ther it actually _ex.1_sts m
the file as p. j. woodhouse. No problem, JUSt embed the • m the search stnng. ·
Your UN[>:,
,,
. The li1,·

$g
rep "p. *woodhouse" emp.lst
dh e Imanager Isales
1265IP ·j . woo ous
109/12/631 90000

.f u literally look for the name P. j. wood house th


'
Note that i yo·\ woodhouse. The dots need. to be escaped here 'w·then the e~Pr .
should be P\ ,J sed
· • 1- ·
in the shell for despecia izmg the next characteri the\ --thee1s1t.·"1
character you u ·
he shell the • (dot) means to a regular expression
w
? ~
S to t .
Note
The ,.. and $: Specifying Pattern Locations
15.4.4 A regular expression P?ssesses one more property in that it can match a patte
beginning or end of a bne. These are the two characters that are used: rna1 11t

" (caret) - for matching at the beginning.


$ - for matching.at the end.
Anchoring a pattern in this way is often necessary when it can occur in mo th
' .
. d. .
.
lace in a line and you are mtereste m its occurrence only at a particular 1 .
re~-
oc-
p Consider a simple example. Try to extract those Imes where the empid be ·.
with a 2. What happens if you simply use ~ns

2•. .
as the expression? This won't_do because the character 2, followed by three charac1e11,
can occur anywhere in the line. You must indicate to grep that the pattern occurs at Ill
beginning of the line, and the" does it easily:

$ grep ""2" emp.lst


2233lcharles harris lg,m. Isales 112/12/521 90000
2365ljohn woodcock !director !personnel Io5/ll/47 I120ooo
2476ljackie wodehouselmanager Isales I05/0l/59l110ooo
2345ljames wilcox lg,m. !marketing I03/12/45l110ooo

Similarly, to select those lines where the salary lies between 70,000 and 89,999 doll311,
you have to use the $ (nothing to do with the currency) at the end of the pattern:

$ grep "[78) .... $" emp.lst


5678lrobert dylan ld,g.m. !marketing 104/19/431 85000
3212lbil~ wilcocks ld,g.m. !accounts j12/12/55I 85000
3564lron1e trueman lexecutivelpersonnel 107/06/471 75000 1
How can yo th · th rnpids doi'
. . u reverse e search and select only those lines where e e id ~
beg~ wi th a 27 You need the expression "["2] and the following conunand shoU
the Job: '
grep ""[A2]" emp.lst
Jjne 10
UNIX has no comm d th . e a pipe
"grep" those Ii an at hsts qnly directories. However, we can us
nes from the listing that begin with a d:
. R ular Expressions-grep and sed 445
f
using eg
15: filters

(lll
p11r
ls -1
I grep ""d" Shows only the directories

It's indeed strange th at in an operating system known for its commitment to brevity and
options, you hav~ t~ type s~ch a long sequence simply to list the directories! You
should convert thi s mto an ahas ( l7.4) or a shell function (19.10) so that it is always
vailable for you to use.
a dd '·
Here's"how g~ep can~ . power to the ls -1 command. This pipeline locates all
files which have wnte penruss1on for the group:

$ 1s -1 I grep , ..... w
I A I

Locates w in sixth position


drwxrw-r-x 3 sumit d_ialout " 1024 Oct 31 15:16 text
-rwxrw---- 1 henry dialout 22954 Nov 7 08:21 wall.gif
-rw-rw-r-- 1 henry dialout 717 Oct 25 09:36 wall.html

This sequence matches a w at the sixth column locat,ion of the ls, -1 output-the one
which indicates the presence or absence of write permission for the group.

The caret has a triple role to play in regular expressions. When placed at the beginning of a

Q character class (e.g., [-:-a-z] ), it negates every character of the class. When placed outside it,
and at the beginning of the expression (e.g., "2 ... ), the pattern is matched at the begin-
Note ning of the line. At any other location (e.g., a"b), it matches itself literally.
'
15.4.5 When Metacharacters Lose Their Meaning
It's possible that some of these special characters actually exist as part of the text. If a
literal match has to be made for any of them, the "magic" of the characters should be
turned off. Sometimes, that is automatically done if the characters violate the regular
expression rules. Like the caret, the meaning of these characters can change de~nding
on the place they occupy in the expression.
The - loses its meaning inside the character class if it's not enclosed on either side
by a suitable character, or when placed outside the class. The . and * lose their mean-
ings when placed inside the character class. The * is also matched literally if it's the first
character of the expression. For instance, when you use grep 11 * 11 , you are in fact look- i
ing for an asterisk. I I
Sometimes, you may need to escape these characters, say, when looking for a
pattern g*. In that case, grep "g*" won't do, and you have to use the\ for escaping.
Similarly, to look for a [, you should use \ [, and to look for the literal pattern . *, you
should use \. \ *.
Regular expressions are found everywhere in the UNIX system. You have already
used them with v; and emacs-and now with grep. Apart from them, some of the most .l
t
powerful UNIX commands like egrep, sed, awk, perl and expr also use regular
expressions. You must understand them because they hold the key to the mastery of the
UNIX system.
We'll introduce some more metacharacters used by regular expressions later in
· this chapter and when we take up perl. To understand some of them, you need to know
!he egrep command first.
446
Your UN
. m,n
You should always keep 1n . d t hat a regular expression t . !Jc. 11ie IJ1,;,.,,,
. h .
beginning of the hne. The mate 1s also made for the Ion nes torn atch a t . , ,·
use the expression 03 . *05, it will match 03 and 05 as clogest Poss·b1
I e strins ring neare
Note . point
. acquires . ·t·1cance when these e se to .the Ieft andg. lhu
. s1gn1
respectively. Th 1s . s, 1<n-.it½
xpress1ons ar right ··~,
eusedf of~
Or Sub
15.5 egrep and f grep: The Other Members stitl.li
1

The egrep and fgrep commands extend grep's pattern- .


f , . b h
both use most o grep s options, ut ave some special fe t matching capabir .
.
search for multiple patterns and also take them from a fiil a ures of th e1r. ownHies ,.,
'tt. 1r

How do you now locate both wood house and woodcock e. f · •neYe
GNU grep achieves by using multiple -e options? whi·s 1
. ro~ th efile,ath·
Delimit the two expressions with the I and the job is done: easily done With in
· 1s e;r,
,~

$ egrep 'woodhouselwoodcock' emp.1st


2365 ljohn woodcock !director !personnel
I05/11/47 I120000
1265 Ip. j. wood house Imanager Isales
109/12/63 1 90000
The I is a regular expression character used by egrep; we ' ll discuss the h
are speci~l -~o 'egrep in the next section. With fgrep , you would have to cl aracters~i
. b y 1tse
tern on a separate Ime
-
.- lf:
.. -- - -
-
-
..
- - pace each pa1·
--
fg re p 'woodhouse
woodcock' emp.lst

C shell users should escape the newline/ character by using a \ at the end of the fiN
line. fgrep doesn't use any regular expression character-including the I used by
egrep. If the pattern to search for is a simple string or a group of them, fgrepisrec-
ommended. It is arguably faster too, the reason why it's known as fast grep.

15.5.1 Storing Patterns in a File: (-f)


So far you have been scanning the file for two or three patterns at the most. Whatdo
you d~ if there are quite a number of them? Both commands support the -f (file)opuon
. exac tly the same w.av·
to take such patterns from the file . The patterns should be storect m
you use them in the command line. Here 's how you'll fill up a file for use by egrep

$ cat pat.1st
adminlaccountslsales

And, here 's how the file should look like if fgrep were to use it:
$ cat pat.1st
admin
accounts
sales beus~
Id now
To look for these three patterns the egrep and fgrep comm ands shoU
in these ways:
. Regular Expressions-grep and sed
• f·11ers l)stn8 447
() '. I
ir11r
egrep -f pat.1st emp.1st
fgrep -f pat.1st emp.1st

The principal& disadvantage with the commands of the grep f -1 · th f th


·1· · 'd . am1 y 1s at none o em
h as separate
. .1ac1 1ties to I entify fields • and it's not eas t h & •
y o searc 1or an expression •
m
a field . This 1s where awk and perl score over them.
egrep accepts- all th e regular expression characters discussed previously (though
not necessarily all th e ones grep actually supports), but uses some special characters
too. This calls for the second round of discussions on regular expressions.

15,6 Regular Expressions-Round Two


The regular expressions we have used so far in grep can also be used as patterns in
egrep (and also in awk). While grep and sed use some more characters not recognized
by egrep, egrep's set incluoes some additional characters (Table 15.3) not used either
by grep or sed.

15,61
, The+ and?
egrep's extended set includes two special characters-+ and?. They are often used in
place of the * to restrict the matching scope. They sigwfy the following:
+ Matches one or more occurrences of·the previous character.
? Matches zero or ~e occurrence of the previous character.
Now, what all this means is~that b+ matches b, bb.-bbb, etc; it doesn't match nothing-
unlike b*. The expression b? matches either a single instance of..,b or nothing. These
characters restrict the scope of match as eompared to the *.
In the two truemans that exist in emp. l st (15:I), note that the character e either
occurs once or not at all. So, e? is the expression to use here:

$ egrep "true?mann emp. lst


3564lronie trueman jexecutivejpersonne1 I01 /06/471--75000
OllOljulie truman jg.m. !marketing 112/31/401 95000

TA B LE 1S.3 The Extended Regular Expression Set Used by egrep and awk
\
I
I\

Expression Matches
ch+ One or more occurrences of character ch
g+ At least one 9
ch? Zero or one occurrence of character ch
."l
g? Nothing or one g I
expl jexp2 Expression exp 1 or exp2
GIFIJPEG GIF or JPEG
(xl Ix2)x3 Expression x1x3 or x2x3

(lock Iver)wood ~ - -, ockwood or verwood


448
Your Utv
IX: "t!i,
The + is a pretty useful character too. When you are lookj u,,;....,
but don't know how many spaces separate the two you ng for two co '• r,
'
to represent at least one space. Th us th e expression a o+bcan fou 11sec
ow the Utive
lllatches th space -~~l
aob aoob b aoooob ... ... Patteriis:i

15.6.2 The I, ( and ) : Searching For Multiple Patterns


In the previous section, we used the I to match multiple p tt .
a erns in th·
IS Wa
egrep 'woodhousejwoodcock' emp . lst Y:

The I is yet another character used by egrep's regular expre .


ss1on s '"rL
using the parentheses, egrep offers an even better alternative. Yi e1· '11ere are
to group patterns and the pipe to act as delimiter: ou can Use Paren~

$ egrep 'wood(housejcock)' emp.lst

~--
2365jjohn woodcock !director jpersonnel I05/11/47 j120000
1265jp.j. woodhouse !manager Isales 109/12/631 90000
You can now combine the other regular expression characters that w
to form . a rather complex sequence: 1n gr,1

$ egrep 'wilco(~x]k*s*jwood(housejcock)' emp. lst


2365jjohn woodcock jdirector jpersonnel I05/11/471120000
1265jp.j. woodhouse !manager Isales 109/12/631 90000
3212jbill wil~ocks jd . g.m. !accounts 112/12/551 85000
2345jjames wilcox jg.m. !marketing I03/12/45 j 110000
Even though it appears that egrep supports the regular expressions used by grep,that'i
not true. There are some characters and examples listed in Table 15.2 which are n~
supported· by egrep. We'll discuss them when we take up sed. Our coverage of regu-
lar expressions is not over yet.

"
You can use grep -E also to use egrep's extended regular expressions. The -F option
malces grep behave like fgrep.
Linux

15.7 sed: The Stream Editor . dbyIP


· a multipurpose tool which combines the work of severa1filters
sed IS . · Designe
UNIX (not &s·
McMahon, it is derived from theed line editor, the original editor ~ nactson a
0

cussed m . thi s text). sed Is


. used for performing
. nonmterac
. tive operations./
data stream. hence its name. . ·chY(JI
.
sed has very few options and its ower is denve rom d f the ease w1
rous featu
b th I . '
can o se ect Imes and frame instructions to act on t£1· th It has
. numeaints, we 'II ha_,'t
a Imost bordenng . on a programming language. Due to o bVIO . us constr r by per1· (n fa,,
to stop short of its limits because its functions have been taken ove
perl often handles them better-and faster.
. Regular Expressions-grep and sed
sing
·uers U
5: f I
449
61
l ~ thing in sed is an instruct' .
· ta he takeIon An IPStmct"100 camhjnes an address ~or
· 1·mes WI"th an action
a OR them: ,,

sed options 'address action ' file( s)

The address and action are enclosed w·th· .


drawn from se d' s f;amt·1 Y of mtemal
. commI m dsmgle auot es. The action . component is
display (print) or an editing function like . an (Table 15.4). It can either be a simple
· · msert1on delef •
components of a sed mstruction are shown in Fi ' Ion or substitution of text. The
15
You can have multiple instructions • . g. -1.
· m a smgle sed
address and action compo_nents. This is what mak comma~d, each with its own
es th e command so powerful.

FI G U RE 15.1 Components of a sed Instruction

sed •l!)~ls/"bold/TOLD/g' fool


address action

TA BL E 15.4 Internal Commands Used by sed

Command Significance

i, a, C Inserts, appends and changes text


d Deletes line(s) •· .
1,4d Deletes lines 1 to 4
r foo Places contents of file foo after line
wbar Writes addressed lines to file bar
,.
p Prints 1.ine(s) on standard output f l •~

3,$p Prints lines 3 to end (-n option required)


$!p Prints all lines except last,line (-n bption required)
Prints lines enclosed between begin and end (-n option
/begin/,/end/p
required)
Q
Quits after reading up to addressed line
lOq Quits after reading the first 10 lines
= Prints line number addressed
Replaces first occurrence of string or regular expression s1 in
s/s1/s2/
all lines with string s2
Replaces first occurrence of,- in lines 1Q, to ~Owith a :
10,20s/-/:/
. Replaces all occurrences of string or regular expression s1 in
s/s1 / s2/g
all lines with string s2 _
Replaces all occurrences of - in all lines wjth a :
s/-/:/g .

l <Ill
450

' c.~
Your UN1x.
Before proceeding further, C shell users must note that When a Sed co
· 1'neu11;
. h t i·ine by pressing the [Enter] key, the shell generates an ll"lrtia d .
rn t e nex . error 11 is c
, t hed" quote As a general rule, escape all lines except th and co.... °'1tin..
'unma c
th · . . , . e Iast vv· h " 1Plai •
? prompt. (Some systems lrke Solarrs don t display this prornpt) 1t a\ to ns ~.
eh· escaping is required are pornte · d out sometimes,
· but not alway · The s1tlJatio
· 9en..•~"I
sue
d d . h f
't work which means the comman rn t at orm simply Won't
. s. Sorn . ns 'M.
et1mes •·"'•
oesn , vvork · , ¾ .'
Note Chapter 1 7 emphasizes that the Korn and bash shells :are far supe . 1n this sh
C shell. You'll find sed easier to use .1f you c h oose e1t . h er Korn or bash nor to 8OlJrne ea11·
at least the Bourne shell. If you still don't want to change your worki~s ~lJr login sh~
a different shell for running the sed commands and awk prograrns d~ shell, at least
chapter. Simply execute sh, ksh or bas h-w h.1c h ever .1s available . on youiscussed 1 .n the nu~
tinue working normally. At the end of your session, . run exit to return tor~~m--and ~-

15.8 Line Addressing


-:, .,~,.
Your lnnin ,,

Addressing in sed is done in two ways:


• By line number (like 3,7p).
• By specifying a pattern which occurs in a line (like /From:/p).
. In the first form, the address s ecifies either _one line_ numb~r to select a sinilc
lme or a set of two (3, 7) to select a group of contiguous lmes. L1kewse, .the secooo
form uses one or two patterns. In either ca~e, the action (p, the print commandh
appended to this address. We'll consider line addressing first.
Let's first consider the instruction 3q-using a line address. This can be broken
~ Qwn to the address 3 and the action q (quit). When this in~truction is enclosed within
quotes and followed by one or mor:e filenames, you can simulate head -3 in this war
$ sed '3q' emp.lst
2233lcharles harris lg.m. Quits after line number l
9876lbill johnson Isales 112/12/521 90000
Idirector lproductionl03/12/50ll30000
5678 i' robert dyl an ld.g.m. !marketing 104/19/431 85000
sed also uses the p (print) command to print the output. But notice what happens wheo
you use two line addresses with the p command:
$ sed 'l,2p' emp.lst
2233lcharles harris lg.m. Isales 112/12/521 90000
2233lcharles harris
9876 lg.m. Isales 112/12/521 90000
9876 lbill johnson !director lproductionl03/12/50l130000
lbill johnson !director lproductionl03/12/50l130000
5678lrobert
2365 dylan ld.g.m. !marketing 104/19/431 85000
ljo_hn woodcock !director !personnel I05/11/47l120000
... mo,e Lme, with eoch flne di,pfayed only once... • c<

By default sed · . . . to the line


• s.a11e· n
b th .' pnnts all Imes on the stand~d output in addmon . But thtS 15
Y e acbon. So the addressed lines (the first two) are printed twice.. printing. j
what_yo~ wa_nted, and you need to use an option with sed to suppress tbis
solution 1s discussed next.
. R gular Expressions-grep and sed
'JI'~ usiog e 451
J5: fI
~ 1er
essing Duplicate Line Print' (
Sup~r lines you should use '"~ -n) To overcome the problem of printing
dupJicat~ ' th e -n option whenev~r you use the p command. Thus,
the prev10us command should have been written as follows:

$ sed -n '1,2p' emp.lst


2233 /char1es harris /g.m, /sales /12/12/521 90000
ga76/bi11 johnson /director /productionl03/12/50/130000

And, to sel~e last line of the fil\!, use the $:

$ sed -n '$p' emp.lst


Oll0/ju1ie truman /g.m. /marketing 112/31/40/ 95000

The address and action are normally enclosed within a pair of single quotes. As you

-
have already learned by now, _you should use double quotes only when parameter eval-
uation or command substitution is embedded in a sed instruction. ·

Reversing Line Selection Criteria ( !) You can use sed's negation operator (!)with
any action. So selecting the first two lines is the same as not selecting lines 3 through
the end. The comm~d sequence prior to the previous one can be written in this way too:

sed -n '3,$!p' emp. 1st qon't print lines 3 to the end

Selecting Lines from the Middle sed can also select lines from the middle of a
file-something that's not possible with either head or taf 1 (acting alone):

sed -n '9,llp' emp. 1st Unes9to11

Selecting Multiple Sections sed is not restric!ed to selecting contiguous grou~s of


lines. By placing each instruction on a separate hne, you can select as many secttons
from just about anywhere:
I
I
3 addresses in one command,
sed -n 'l,2p
7,9p using only a single pair of quotes
Se/eds the last line
$p' emp . 1st

You can place all these instructions in a single line too, but each instruction has to be
preceded by the -e option:
Same as above
sed -n -e '1, 2p' -e 'l,9p' -e '$p' emp.1st
J '

U. se the -n option whenever you use the P comman_, d un less


.
you deliberately want to select
lines twice. Very rarely will you need to print every line twice.
452

15.9 Context Addressing


The second form of addressing lets you specify a pattern (or t
bers. This is known as context addressing where the pattern ;o) rather than r
can locate the senders from your mailbox ($HOME/mbox) in th~s a/ 0neither 1ne \:
1

1s way• s1h
. ·r~
$ sed -n '/From: /p' $HOME/mbox
From: janis joplin <joplinj@altavista.net>
From: charles king <charlesk@rocketmail.com>
From: Monica Johnson <Moni~aj@Web6000.com>
From: The Economist <business@lists.economist.com>

Both awk and perl also support this form of addressing. Ideally y h
lookmg. for From: at the begmnmg
. . of a hne.
. sed also accepts regulare
' ou s OUld. llill1~
.
type we used w1th:grep. . command hnes
The followmg . should refreshXpress1ons
y r
01~
. our me~ :
sed -n '/AFrom: /p' $HOME/mbox
sed -n '/wilco[cx]k*s*/p' emp.lst " matches at beginnif19
sed -n "/o'br[iy][ae]n/p Both wi 1cox andwi1%
/lennoh/p" emp.lst Either the o'boen,
ler.rQ

Note that we had to use double quotes in the third example because the pattern i~
contains a single quote. Double quotes protect single quotes in just the same waim
gle quotes protect double, . ·

C shell users should note that you must-•add ai' \ at the_end of the first line in
the third example above. Otherwise, the shell will generate the error message
11
Unmatched • as it always does when~ver it sees a line containing an uncl~
double or single quote.

You can also specify a comma-separated pair of context addresses 10 sela1~


groµp c;>f contiguous lines. What is more, line and context addresse_scan also be 11111
sed -n '/johnson/,/lightfoot/p' emp.lst
sed -n 'l, /woodcock/p' emp. 1st Space after comnirO

. to I'is t files W
In a previous example (15.4.4 ), we used 1s and grep in a pipehne ~
h . . for the group. We can do that w1'th sed as well:
ave wnte
. perm1ss1on

ls -l I sed -n '/A w/p' ,


..... 1131,
- eswe
.
Regular expressions . grep and sed are actually -more
m - -powerful than the
h onin the i)ti
d f
use so ar. They use some more special characters, and we ,II meet t e111
round of discussions at the end of the chapter. J
. of /s· 1
·thin a pair
All context addresses, whether single or double, must be enclosed w1ep (but not the e!~r
th
em, you can use regular expressions of the type understood by gr
Note variety).
. Regular Expressions-grep and sed 4S3
·11ers using
J5: f I
~,r
Editing Text
Apart from selecting lines, sed can also edit text itself. Like v; , sed uses the i (insert),
a (append), c (change) and r (read) commands in similar manner. These commands are
discussed next. ·
Inserting and Changing Text (;, a and c)
1s.10.1
For appending text, you have to use the a command and then enter as many lines as you
want. Each line except the last must have a \ at the end. You can append these two lines
to the end of this perl library file in this way:

$ se'd '$a\ Appending to end of file


> #You must place the 1ollowing line at the end\
't'H •j' , ·
> 1 ;
>, www_lib.pl > $$

You can actually key in as many lines as you wish, but you have to precede the [Enter]
key in each line J(
except the last with a\ . This technique has to be followed when using
I ,,,. \

the ; and c commands


I • '.!
also ...$), which $ignifies the shell's PID, is used here to frame
fi f I ' • Jt

a numerici"F
filename. You can
'~, • ,
use any_. J ffilename
,• ; {; , -.
here" you want; it's just that you are I

unlikely to overwrite any ex1stmg file 1f you use $$.

Double-Spacing Text Wh~t is the consequence o( not using an address with these
commands? The inserted or changed text is t):ten placed after or befote every li~e of the
1
fi le. The following command: ' • •;
I ' ">

sed 'i \ Inserts before every line


l !F
' this blank line
I foo • X

inserts 'a'blank line before each line 6f the pnn ted. This is another way of double- fileis
spacing text (9.4.1). The difference' betw'een .; artd a is that ; inserts text before the
addressed line, while a does the same after the line.
{'l • • • 't~J
1
" oi a u1\t" <r ,.... .,
These commands won't work in the C shell in the way described here. You have to
use two /s for lines that already have one f, and one-/ when there is none. The pre-
vious command ,will work in this way, in the C shell:
sed 'i \ \ Two \s here
\ and one here
I foo

This is im'~wkward form 'of usage and is not intuitive at all. The sed, awk and
~rl co~an* sho~~ ,,flln in ~so!her sh~~s. O l;;i •I , v. :-. , l. I

Reading in a File (r) Toe r command lets you read in a file at a certain location
of the file. This is how you can •insert a form's details from an external file
template.html after the <FORM>•ta'g: .)
r ... , ,
454

sed '/<FORM>/r template.html' form_entry.html


15.10.2 Deleting Lines (d)
Using the d (delete) co~and, sed can emulate grep's -v Opti
Containing the pattern. Either of these commands removes 0 n to sei
perl script: · COlllJne nt lines
· of Ii."''1 ~
sed '/A#/d' foo > bar as~
1
sed -n '/A#/!p' foo > bar
-n option to be
1/iirj~
Deleting Blank Lines A blank line consists of any number of
ing. How do you delete these 1mes
. spaces' labsor ~
tirom a fi le? Frame a pattem Which
more occurrences of a space or tab: lllatches ~ ~
sed 'r[o~]*$/d' foo
ASfJOcea""
••Oij
You need to press the [Tab] key or [Ctrl-i] inside the character class-· .
after ~e spa~e. Providin~ a A. at ~e be?inning '":1d a $ at ~e end matches ~
contain nothing but whitespace. Obviously, this expression also matches
that contain nothing.
0
those:
1

15.10.3 Writing to Files (w)

Thew (write) command writes the selected lines to a s~parate file. You can save~
- lines contained within the <FORM> and </FORM> tags in a separate file:

sed '/<FORM>/,/<\/FORM>/w forms.html' pricelist.html

Every <FORM> tag in an HTML file has a corresponding </FORM> tag. The /here neoJ
escaping as/ is sed's pattern delimiter. Here, the form contents are extractedandsai~
in forms. html. To go further, you can save all form segments from all HTMLfilesi
a single file:
1
sed /<FORM>/,/<\/FORM>/w forms.html' *.html

sed's power doesn't stop here. Since it accepts more than one address, you cdanstore
pedtx~e
a full context splitting of its input. You can search for three patterns an
matched lines in three separate files-all in one shot:

sed '/<FORM>/,/<\/fFORM>/w forms.html


/<FRAME>/,/<\;FRAME>/w frames.html
/<TABLE>/,/<\/TABLE>/w tables.html' pricelist . html
Q
Note
. of t he rnes
The wcomma nd outputs a// lines on the terminal irrespective 1 actua
Uy W(irt~'

separate files. If you prefer silent behavior, then us~ the -n option. ace~

When there ar
- .. . .
tion to 111
se the -f op f il i'·
. . e numerous ed1t1ng instructions to perform, u
instructions from flI F e sed •
f tnstr,
. st . a e. or the example above, you can us
Tip 1
where 1n r • fi contains the instructions in this format:
. Regular Expressions-grep and sed 455
fJI~ VSJll8
115:
r/ /<FORM>/,/<\/FORM>/w forms.html
/<fRAME>/,/<\/FRAME>/w frames.html
/<TABLE>/,/<\/TABLE>/w tables.html
you can specify some more instructions with the -e optl . h
st from the file
.
on m t e command hne and let se
d
take the re ·

substitution
15,11 sed's strongest feature is undoubtedly substI·tuu·on ' aChieved Wl'th Its
· s (SUbstitute
· )
coounand. It lets you replace
. . a pattern in its input w'th
I somethi ng e1se. v:.1ou have
encountered the syntax m v1 before (4.16):
[addressJs /expression]/ string2/ flag
Here, exp~essionl (which can also be a regular expression) is replaced by str:ing2 in all
lines s~ifi~d by _the laddress]. Unlike in vi, however, if the address is not specified,
the subsbtution will be performed for all lines containing expression]. This is how you
replace the I with a colon: _
$ sed 's/l/:/ 1 emp.lst I head -2
2233:charles harris lg.m. Isales 112/12/521 90000
9876:bill johnson !director jproductionj03/12/50ll30000
But notice what happened. Just the first (left-most) instance of the I in a line has been
replaced. You need to use the g (global) flag to replace all the pipes

$ sed 's/1/:/g' emp.lst I head -2


2233:charles harris :g.m. :sales :12/12/52: 90000
9876:bill johnson :director :production:03/12/50:130000
W~ used global substitution to replace all pipes with colons. Though we are seeing
two lines here, the substitution has been ca.med out for the entire file.
You can limit the vertical boundaries too by specifying an address:
First three lines only
sed 1 1,3s/l/:/g 1 emp.lst
Substitution is not restricted to a single character; it can be any string. The string to be
replaced can even be a regular expression:
1, $s implied here
sed 's/<I>/<EM>/g' foo.html
sed -n 1 s/gilmo[ur][re]/gilmour/p 1 emp. lst
Note the use of the -n option in the second example which not only converts gi l mou r
and Qilmore into a single gi l mour, but selects jm:t those lines as well.

Checking Whether Substitution Is Performed sed shows you the ~ontents of


the entire file on the screen (unless redirected); it doesn't tell you whether It has been
able to perform any substitution at all. Unlike grep whichfa!ls when it can't find a pat-
tern, sed is not considered to fail when it is unable to substitute.
456
Yo11r lJNfJ<. .
·~u1,·
''%
In that case, how does one know whether a substitution has '¾
all? Using our knowledge of the other UNIX filters, we can fi been Pert:
nd 0
pipes that will be replaced by this sed command: ut the 0 °~ ,
UllJber '
fJf
$ sed 's/1/:/g' emp.lst I cmp -1 - emp.lst I we -1
75

sed's output here is compared with the original file. (cmp's - l opti·
on Proct
each unmatched character.) When we counts these lines it effective! uces a Ii,.
' · Y te 1ls u - ""' r~
have been replaced. A count of O would mean that no substitution h be s that 75 Pi
as en Perfo
Performing Multiple Substitutions You can perform multiple b . O!ied.
. SU Slit ·
one invocation of sed. Simply press [Enter] at the end of each instru . UIJons~~
· · ctJon
close the quote at the end: ' and Oien

$ sed 's/<I>/<EM>/g
For csh add O \
> s/<B>/<S~RONG>/g Ol!r(f
for every rflli
> s/<U>/<EM>/g' form.html
e~cept knt
sed is a stream editor; it works on a data .stream. This means that an instru .
. Th. . . ctton
processes th. e output o f the prevwus one. 1s 1s something users often forget· 1h
don ' t get the sequence right. Note that the following sequence finally converts ~I<;!
tags to <STRONG>:

$ sed 's/<I>/<EM>/g
> s/<EM>/<STRONG>/g' form.html

When lb.ere are-a group of instructions to execute, you should place these s instructions
in a file instead, and then use sed with the - f option.

When a g is used at the end of a substitution instruction, the change is performed global~
along the line. Without it, only the left-most occurrence is substituted.
Note

Compressing Multiple Spaces How do you delete the trailing spaces from ~e sec·
ond, third and fourth fields of the employee· databasef The,regular expression required lil
the source string needs to signify zero or more occurrences of a spa<;~, followed by a 1-

$ sed 's" *I" I"~' emp. 1st I head' -2 Space before ;


2233lcharles harrisjg.m.jsalesjl2/12/52I 90000
9876lbill johnsonldirectorjproductionj03/12/50 jl30000
haracter to be
We've used the A instead of the / this time. sed (anc;l v;) allows any c ·nos. r,1os1
stn
used as the pattern delimiter as long as it doesn't occur in .any of lhe /tJeCaust
. . . bl
UN IX system files (hke /etc/passwd) follow this vana e- en. . I gth forma file forina
I
15 th
UNIX tools can easily identify a field by seeing the delimiter. This e
you'll be using with the awk command later. ·
.
rs using
Regular Expressions-grep and sed 457 .j
1J: fi l!C
·l
I The -n option and ~rint (~) com'.11a nd are generally not used when performing substitution.
This. means that all lines will be displayed , wh eth er a su bst1tut1on
• . has been performed or not.
This is what we normally want sed to do.

The Remembered Pattern ,


15,11. 1
s~~!!:
we've looked for a pattern and then replaced it with something else. Truly
s 7ng, the three commands below do the same job:

sed 's/director/member/' emp. 1st


sed '/director/s//member/ ' emp.1st.
sed '/director/s/director/member( ' emp. 1st

The second form sugge sts that sed "remembers" the scanned pattern, and stores it in
// (2 frontslashes). The II representing ~n empt~ (or null) regular expression is inter-
preted to mean that the search and substituted patterns are thesame. We'll call it the
remembered pattern.
However, when you use / / in the target string, it means you are removing the
pattern totally:

sed 's////g' emp. 1st Remove every / from file

The address /di rector/ in the third form appears to be redundant. However, you must
understand this form also because it widens the scope of substitution. It's possible that
you may like to replace a string in all lines containing a different string:

$ sed -n '/marketing/s/director/member/p' emp.lst


6521 Iderryk o' bri en Imember Imarketing I09/26/45 I125000
The significance of // depends on its position in the instruction. If it is in the source string,
it implies that the scanned pattern is stored there. If the target string is / /, it means that the
Note source pattern is to be removed.

15.11.2 The Repeated Pattern


There are further surprises in store. When a pattern in the source string also occurs in
the replaced 'string, you can use the special character & to represent it. All these com-
mands do the same thing:

sed 's/director/ex'ecutive director/' emp.1st


sed 's/director/executive &/' emp.1st
sed '/director/sf/executive&/' emp.1st

J,
The known as the repeat~d pattern, expands to the ~ntire so~rce string. Apart_ from
the numbered tag ( 15.12.2), rhe & is tl'le only other special character you can use m the
replacement string. All other characters are treated literally.

You might also like