Sunday, May 27, 2007

Tutorial: awk sees text files as rows and columns

Building on the regular expressions Tutorial I posted yesterday, this introduction to awk will, if you have never used "awk" before, revolutionize your shell scripting experience. As I hinted in the Title, the awk command "sees" its input as a collection of rows and columns. So the simplest awk command would be one which prints column 1 of a text file. Before I show an example I must mention two things about awk.
  • Awk uses the $ sign extensively in its internal language, so it is common practice to use single-quotes ('....') to escape awk "statements to protext them from the shell.
  • Awk's default "field seperator" is a collection of blanks, i.e. spaces and tabs.
Keeping this in mind, if we look at the output from a command such as ls -l:
root@linwarg:/etc# ls -l
total 1996
drwxr-xr-x  8 root   root      4096 2007-04-16 10:03 acpi
-rw-r--r--  1 root   root      2077 2006-09-09 21:08 adduser.conf
-rw-r--r--  1 root   root        46 2007-05-26 15:24 adjtime
-rw-r--r--  1 root   root        50 2006-09-09 21:24 aliases
drwxr-xr-x  2 root   root      8192 2007-05-26 16:01 alternatives
-rw-r--r--  1 root   root       395 2007-03-05 08:38 anacrontab
drwxr-xr-x  7 root   root      4096 2007-04-16 10:05 apm
...
We see that the file size is always listed in the 5th column. So lets print just the sizes:
ls -l | awk '{print ($5)}'
This highlights some more aspects of the awk scripting language.
  • The awk language surrounds statements in curly-braces.
  • The "$5" reminds of the shell's default variable referring to the 5th command-line argument, but here, being between a pair of single-quotes, it is interpreted by awk, not by the shell. As is immediately evident, the above command prints only the 5th column of every row of input.
The awk language has got two special "address" prefixes that can be given to commands, being the keywords "BEGIN" or "END". Prefixing on of these to a statement will cause the statement to be executed only once, either immediately before reading and parsing the first line of input, or in the case of END once after reading and parsing the last line of input. Now consider this more complex example:
ls -l | awk '{TOTAL=TOTAL+$5} END {print (TOTAL)}'
The awk script command consists of two separate statements. The second statement is prefixed by the END keyword. What this script does is, for each line of input encountered, add the value of the 5th field (implicitly converted to a numeric value) to the variable TOTAL. After all lines of input has been read, print the value of the TOTAL variable (implicitly converted to a text string). Note that awk initiates variables to zero/null values on first reference. The above example prints only a single line, showing the value of the TOTAL obtained. In fact, all awk commands takes addresses - The above examples both uses a default (blank) address that matches all input lines. Addresses can be specified as simple regular expressions, or as complex conditions. A typical use would be to add up only the 5th column for files with some commonality in the name, eg to add up all the files with names like "*txt", you can use:
ls -l | awk '/txt$/ {TOTAL=TOTAL+$5} END {print ("Total for txt files:", TOTAL)}'
The above line only executes the TOTAL=TOTAL+$5 statement when the expression /txt$/ is matched by the current input line. The expression is the "address", indicating when to execute the statement. Statements can be complex, for example to print the list of files being added up, with the total a the end, we can add an extra step into the command like this:
ls -l | awk '/txt$/ {print;TOTAL+=$5} END {print ("Total for txt files:", TOTAL)}'
Note - I used the short-form of the add instruction to keep the command from becoming too long. The print command use with no arguments as in the above example, simply emits the complete input line. awk has also got a printf statement which can be used with great results, for example to simply re-format the input from the ls -l command:
ls -l | awk '{printf ("%-30s %20d %s $%3d %8s %8s %s\n", $8, $5, $1, $2, $3, $4, $6, $7)}'
There are (probably many) better ways to do this, but this example shows some of the mathematical capability of awk:
echo 5 3 | awk '{printf ("%5.3f\n", $1/$2)}'
awk has got a few built-in variables which are constantly updated automatically. The NF variable , for example, reveals the number of fields on the current line. The NR variable contains the current input line (or record) number. Interestingly, the NF variable can be used in conjunction with a $, eg "$NF" to refer to the last field (or word) on a line, even when every line has got a different number of fields, eg:
ls -l | awk '{print ($NF)}'
This can be further advanced, for example to get the second-to-last field, one can use something like:
ls -l | awk '{print ($(NF-1))}'
Note the extra set of braces in the above example. The awk language justifies that awk scripts sometimes be placed in their own separate files, allowing the awk statements to be formatted with indentation, etc. This is especially useful when having many complex statements, and eliminates the problems sometimes experienced with the shell expanding/interpreting special characters in the awk script. I will in the future do a follow-up article on awk as this is only just barely scratching at the surface of its capabilities.

No comments: