Sunday, May 27, 2007

Tutorial: grep and regular expressions

This little article will basically be just enough to get you working with regular expressions. Also I hope that is may serve as a nice introduction to the regular expressions family of man pages to those curious to learn more.

When parsing the messages files or output from strace (Linux) or truss (Solaris) it is often useful to use slightly more complex REs than when finding an entry in the hosts file. Even when checking for the existence of an entry in for example the hosts file from a script, you may want to avoid false matches.

The grep command is a handy, simple command which we can use to test REs. The same REs typically also work with commands like sed and awk and even to a lesser extent in the shell.

To test an RE (expression) we can simply use:

grep "expression" filename

Back to basics: The simplest RE is a character that stands for itself, and by joining together a bunch of these we get a series of characters that matches a specific string. The command "grep xyz /etc/hosts" will find (and print) all lines that contains the sequence "xyz" anywhere.

To match only entries Starting or ending with a string, you can use one of these formats:

grep ^word filename
grep word$ filename

The characters ^ and $ are special regular expressions, but only when used as respectively, the first and last item in the RE. The ^word expression matches lines STARTING with word, while word$ matches lines which END with word.

An easy way to find lines which contain both of two strings is by means of two grep commands, like this:

grep "abcmno" filename | grep "mnoxyz"

This would however match lines that contain "abcmnoxyz", as well as the fact that it will not care about the order in which the two strings appear on the line. To force a check for both entries, with a specific expression being before the other, you can use some wild-cards. Example:

grep abcmno.*mnoxyz filename

Note the dot-star ... The dot says "any character", and the * says "zero-or-more-of-them"

The dot wild-cart on its own (I.e without the * as in above) will match any one character.

grep abcmno.mnoxyz filename

This is similar to using a "?" in shell file name expansion.

The next aspect of regular expressions is similar to the "dot" whildcard, but in stead of matching any character, it matches a character from a list of possibilities. Example:

grep abc[mno]xyz

This matches any of the following

  • abcmxyz
  • abcnxyz
  • abcoxyz

The [...] is used to indicate a list of possibilities. The list can be quite large, and can contain some simple sequences, like [a-z] (for a through z) or [k-p] (for k through p). Another common possiblitiy is [0-9] for any "numeric" character.

A typical use of series using this method is to make one character in a search case-insensitive, for example to find both "foo" and "Foo":

grep "[Ff]oo" filename

We can combine wildcards to search for lines which contain a single number and nothing else:

grep "^[0-9][0-9]*$" filename

  • The first [0-9] will match a numeric in the first position on the line Note the preceding ^.
  • The second [0-9]* will match zero-or-more additional numerics, that is, besides the numeral matched by the first [0-9] occurance.
  • The trailing $ means that after the last numeric, there must be and end-of-line.
  • Conversely, there cannot be anything other than a numeric character anywhere on the line.

Another use of the [...] is to exclude the list. For example to match anything which is NOT a numeric, you can use:

grep "[^0-9]$" filename

This example will find lines NOT ending in a numeric. The ^ as the first character inside the [...] set indicates that the "compliment" of the [...] sequence is to be used.

It is worth noting that in shell file-name expansion, an exclamation mark "!" can sometimes be used to indicate that the search search string must be inversed. For example listing all files starting with a single dot, but excluding the double-dot (parent-directory) can be obtained like this:

ls .[!.]*

The above can be read as list files with at least two characters in the name, the first must be a dot, the second must be anything other than a dot. Note: Some shells support use of either ^ or ! equally in this regard.

2 comments:

Anonymous said...

Liked your explanation:

use some wild-cards. Example:

grep abcmno.*mnoxyz filename


Note the dot-star ... The dot says "any character", and the * says "zero-or-more-of-them"

Was struggling to find an explanation elsewhere

Thanks

Hartz said...

Thank you for your comment - knowing that my article was helpful really goes a long way encouraging me to write more!

Cheers
_hartz