Purchase  Copyright © 2002 Paul Sheer. Click here for copying permissions.  Home 

next up previous contents
Next: 6. Editing Text Files Up: rute Previous: 4. Basic Commands   Contents

Subsections

5. Regular Expressions

A regular expression is a sequence of characters that forms a template used to search for strings [Words, phrases, or just about any sequence of characters. ] within text. In other words, it is a search pattern. To get an idea of when you would need to do this, consider the example of having a list of names and telephone numbers. If you want to find a telephone number that contains a 3 in the second place and ends with an 8, regular expressions provide a way of doing that kind of search. Or consider the case where you would like to send an email to fifty people, replacing the word after the ``Dear'' with their own name to make the letter more personal. Regular expressions allow for this type of searching and replacing.

5.1 Overview

Many utilities use the regular expression to give them greater power when manipulating text. The grep command is an example. Previously you used the grep command to locate only simple letter sequences in text. Now we will use it to search for regular expressions.

In the previous chapter you learned that the ? character can be used to signify that any character can take its place. This is said to be a wildcard and works with file names. With regular expressions, the wildcard to use is the . character. So, you can use the command grep .3....8 <filename> to find the seven-character telephone number that you are looking for in the above example.

Regular expressions are used for line-by-line searches. For instance, if the seven characters were spread over two lines (i.e., they had a line break in the middle), then grep wouldn't find them. In general, a program that uses regular expressions will consider searches one line at a time.

Here are some regular expression examples that will teach you the regular expression basics. We use the grep command to show the use of regular expressions (remember that the -w option matches whole words only). Here the expression itself is enclosed in ' quotes for reasons that are explained later.

grep -w 't[a-i]e'
Matches the words tee, the, and tie. The brackets have a special significance. They mean to match one character that can be anything from a to i.
grep -w 't[i-z]e'
Matches the words tie and toe.
grep -w 'cr[a-m]*t'
Matches the words craft, credit, and cricket. The * means to match any number of the previous character, which in this case is any character from a through m.
grep -w 'kr.*n'
Matches the words kremlin and krypton, because the . matches any character and the * means to match the dot any number of times.
egrep -w '(th|sh).*rt'
Matches the words shirt, short, and thwart. The | means to match either the th or the sh. egrep is just like grep but supports extended regular expressions that allow for the | feature. [ The | character often denotes a logical OR, meaning that either the thing on the left or the right of the | is applicable. This is true of many programming languages. ] Note how the square brackets mean one-of-several-characters and the round brackets with |'s mean one-of-several-words.
grep -w 'thr[aeiou]*t'
Matches the words threat and throat. As you can see, a list of possible characters can be placed inside the square brackets.
grep -w 'thr[^a-f]*t'
Matches the words throughput and thrust. The ^ after the first bracket means to match any character except the characters listed. For example, the word thrift is not matched because it contains an f.

The above regular expressions all match whole words (because of the -w option). If the -w option was not present, they might match parts of words, resulting in a far greater number of matches. Also note that although the * means to match any number of characters, it also will match no characters as well; for example: t[a-i]*e could actually match the letter sequence te, that is, a t and an e with zero characters between them.

Usually, you will use regular expressions to search for whole lines that match, and sometimes you would like to match a line that begins or ends with a certain string. The ^ character specifies the beginning of a line, and the $ character the end of the line. For example, ^The matches all lines that start with a The, and hack$ matches all lines that end with hack, and '^ *The.*hack *$' matches all lines that begin with The and end with hack, even if there is whitespace at the beginning or end of the line.

Because regular expressions use certain characters in a special way (these are . \ [ ] * + ?), these characters cannot be used to match characters. This restriction severely limits you from trying to match, say, file names, which often use the . character. To match a . you can use the sequence \. which forces interpretation as an actual . and not as a wildcard. Hence, the regular expression myfile.txt might match the letter sequence myfileqtxt or myfile.txt, but the regular expression myfile\.txt will match only myfile.txt.

You can specify most special characters by adding a \ character before them, for example, use \[ for an actual [, a \$ for an actual $, a \\ for and actual \, \+ for an actual +, and \? for an actual ?. ( ? and + are explained below.)

5.2 The fgrep Command

fgrep is an alternative to grep. The difference is that while grep (the more commonly used command) matches regular expressions, fgrep matches literal strings. In other words you can use fgrep when you would like to search for an ordinary string that is not a regular expression, instead of preceding special characters with \.

5.3 Regular Expression \{ \} Notation

x* matches zero to infinite instances of a character x. You can specify other ranges of numbers of characters to be matched with, for example, x\{3,5\}, which will match at least three but not more than five x's, that is xxx, xxxx, or xxxxx.

x\{4\} can then be used to match 4 x's exactly: no more and no less. x\{7,\} will match seven or more x's--the upper limit is omitted to mean that there is no maximum number of x's.

As in all the examples above, the x can be a range of characters (like [a-k]) just as well as a single charcter.

grep -w 'th[a-t]\{2,3\}t'
Matches the words theft, thirst, threat, thrift, and throat.
grep -w 'th[a-t]\{4,5\}t'
Matches the words theorist, thicket, and thinnest.

5.4 Extended Regular Expression + ? \< \> ( ) |
Notation with egrep

An enhanced version of regular expressions allows for a few more useful features. Where these conflict with existing notation, they are only available through the egrep command.

+
is analogous to \{1,\}. It does the same as * but matches one or more characters instead of zero or more characters.
?
is analogous to \{1\}. It matches zero or one character.
\< \>
can surround a string to match only whole words.
(  )
can surround several strings, separated by |. This notation will match any of these strings. ( egrep only.)
\(  \)
can surround several strings, separated by \|. This notation will match any of these strings. ( grep only.)

The following examples should make the last two notations clearer.

grep 'trot'
Matches the words electrotherapist, betroth, and so on, but
grep '\<trot\>'
matches only the word trot.
egrep -w '(this|that|c[aeiou]*t)'
Matches the words this, that, cot, coat, cat, and cut.

5.5 Regular Expression Subexpressions

Subexpressions are covered in Chapter 8.


next up previous contents
Next: 6. Editing Text Files Up: rute Previous: 4. Basic Commands   Contents