Purchase | Copyright © 2002 Paul Sheer. Click here for copying permissions. | Home |
A regular expression is a sequence of characters that forms a template used to search for strings [Words, phrases, or just about any sequence of characters. ] within text. In other words, it is a search pattern. To get an idea of when you would need to do this, consider the example of having a list of names and telephone numbers. If you want to find a telephone number that contains a 3 in the second place and ends with an 8, regular expressions provide a way of doing that kind of search. Or consider the case where you would like to send an email to fifty people, replacing the word after the ``Dear'' with their own name to make the letter more personal. Regular expressions allow for this type of searching and replacing.
Many utilities use the regular expression to give them greater power when manipulating text. The grep command is an example. Previously you used the grep command to locate only simple letter sequences in text. Now we will use it to search for regular expressions.
In the previous chapter you learned that the ? character can be used to signify that any character can take its place. This is said to be a wildcard and works with file names. With regular expressions, the wildcard to use is the . character. So, you can use the command grep .3....8 <filename> to find the seven-character telephone number that you are looking for in the above example.
Regular expressions are used for line-by-line searches. For instance, if the seven characters were spread over two lines (i.e., they had a line break in the middle), then grep wouldn't find them. In general, a program that uses regular expressions will consider searches one line at a time.
Here are some regular expression examples that will teach you the regular expression basics. We use the grep command to show the use of regular expressions (remember that the -w option matches whole words only). Here the expression itself is enclosed in ' quotes for reasons that are explained later.
The above regular expressions all match whole words (because of the -w option). If the -w option was not present, they might match parts of words, resulting in a far greater number of matches. Also note that although the * means to match any number of characters, it also will match no characters as well; for example: t[a-i]*e could actually match the letter sequence te, that is, a t and an e with zero characters between them.
Usually, you will use regular expressions to search for whole lines that match, and sometimes you would like to match a line that begins or ends with a certain string. The ^ character specifies the beginning of a line, and the $ character the end of the line. For example, ^The matches all lines that start with a The, and hack$ matches all lines that end with hack, and '^ *The.*hack *$' matches all lines that begin with The and end with hack, even if there is whitespace at the beginning or end of the line.
Because regular expressions use certain characters in a special way (these are . \ [ ] * + ?), these characters cannot be used to match characters. This restriction severely limits you from trying to match, say, file names, which often use the . character. To match a . you can use the sequence \. which forces interpretation as an actual . and not as a wildcard. Hence, the regular expression myfile.txt might match the letter sequence myfileqtxt or myfile.txt, but the regular expression myfile\.txt will match only myfile.txt.
You can specify most special characters by adding a \ character before them, for example, use \[ for an actual [, a \$ for an actual $, a \\ for and actual \, \+ for an actual +, and \? for an actual ?. ( ? and + are explained below.)
fgrep is an alternative to grep. The difference is that while grep (the more commonly used command) matches regular expressions, fgrep matches literal strings. In other words you can use fgrep when you would like to search for an ordinary string that is not a regular expression, instead of preceding special characters with \.
x* matches zero to infinite instances of a character x. You can specify other ranges of numbers of characters to be matched with, for example, x\{3,5\}, which will match at least three but not more than five x's, that is xxx, xxxx, or xxxxx.
x\{4\} can then be used to match 4 x's exactly: no more and no less. x\{7,\} will match seven or more x's--the upper limit is omitted to mean that there is no maximum number of x's.
As in all the examples above, the x can be a range of characters (like [a-k]) just as well as a single charcter.
An enhanced version of regular expressions allows for a few more useful features. Where these conflict with existing notation, they are only available through the egrep command.
The following examples should make the last two notations clearer.
Subexpressions are covered in Chapter 8.