Saturday, June 9, 2007

Tutorial: Redirection and Pipes

Today I realized I should not gloss over two seemingly simple items, the first being pipes and redirection, and the other being substitution. This article will explain the first of these for the benefit of those new to Linux and Unix how to link together simple commands to build powerful tools.

Before I start, a quick introduction to the Unix command-line syntax is appropriate.

  • The simplest Unix command is just a command word, which is the full path name of the file to be executed, starting with a slash representing the root. For example, if you enter /usr/bin/ls, the shell will run the ls program found in the /usr/bin directory.
  • Similarly if you enter a command without specifying the directory, the shell will search all the directories specified in the PATH variable, in the order they are listed, and will execute the first program it finds with the name specified.
  • Slightly more complex commands take options and/or arguments. These are extra "words" which modifies the behavior of the program. For example the ps program lists processes associated with your terminal session, but if you add the "-e" option, it will list all processes running on the system.
  • Simple options which turn on a different behavior in the command are called switches, and these can be stacked. For example the -f switch, which causes ps to include additional details about each listed process and the -e switch mentioned above, can be stacked, like this:
$ ps -ef
    

... to produce a listing of "every process" with "full defaults" A note on the formatting in this article: When a line is preceded with a "$", it means that it is a command to be entered at the shell prompt, though the $ itself must not be entered. The usual shell prompt is a $ for normal users, or a # for root, though your system may well show additional details in the prompt string. Lines in blue text and that does not start with a "$" is the output generated by the command. There is more to commands, but this is enough to be able to understand the rest of this tutorial, so without further delay on to the real subject: Redirection and Pipes. Redirection allows you to "store" the output from a command directly into a file on disk, or to read lines from a file on disk.

$ ps -ef >tmp/processlist.txt
    

If you run the command above you get ... nothing, and you get it fast.. That's right. The ps -ef command on its own produces the expected output, but the ">tmp/processlist.txt" portion causes this output to go into the file /tmp/processlist.txt in stead of appearing in your terminal. You can examine the newly created file with the ls command (Which "lists" information about files and/or directories)

$ ls -l /tmp/processlist.txt 
-rw-r--r-- 1 user1 user1 23100 Jun 4  10:06 /tmp/processlist.txt
    

And you can actually view its contents with any one of the "cat", "less" or "more" commands, like this:

$ cat /tmp/processlist.txt
    

Now that you have this file on the disk you can use it over and over. Enter the above cat command again, and it shows the file contents again. Much more interesting tough, is to filter out specific lines from the file. The head command prints the first 10 lines from the specified file.

$ head /tmp/processlist.txt 
$ tail -4 /tmp/processlist.txt
    

And you guessed it, tail prints the last 10 lines. Specifying a switch with head or tail, like the "-4" above lets you to control the number of lines being displayed if you want something other than the default 10. Much more interesting than "head" or "tail" is grep, which I used extensively in a previous tutorial.

The "grep" command filters lines out based on a special rule for searching, called a "regular expression". These can be quite complex, but does not always need to be, e.g:

$ grep root /tmp/processlist.txt
    

The above command reads the specified file and prints only the lines that contain the string "root". You can also get grep to do the inverse, like this:

$ grep -v root /tmp/processlist.txt
    

Note the "-v" switch which modified grep's behavior to print lines not matching the expression. In the above examples, "grep", "head", "tail" and "cat" were expressly told what file to open and search through. The /tmp/processlist.txt filename specified is "an argument" of each command. Combining grep searches with redirection allows us to create more stored files for further parsing.

$ grep root /tmp/processlist.txt >tmp/root_processes.txt 
$ grep -v root /tmp/processlist.txt >tmp/nonroot_processes.txt
     

This then creates two files, with complementary information - the first with all the lines containing the string root in the file /tmp/root_processes.txt, and the other file with all the lines that does not contain this string. In the redirection examples above the "greater than sign" (>) was used like an arrow pointing from the command ... to the file. I mentioned above that redirection can also read from a file, in which case you just need to switch the direction of the "arrow" For example

$ wc -l < /tmp/nonroot_processes.txt
    

Will read the file /tmp/nonroot_processes.txt and print the number of lines found. (The wc command is the so-called "word count command", and the -l switch is used to modify its behavior to count lines in stead. The "grep", "head", "tail" and "cat", etc commands could also have been used in this fashion. See for example:

$ cat
      

which is almost identical to

$ cat /tmp/root_processes.txt
      

In fact, the difference is of academic importance only, and has got to do with how the file is opened and who owns the file handle. The output is identical. However in the case of wc, the output is not the same. GASP! How can wc report something different depending on when the file was opened after I just said where the file is opened is of academic importance only?The answer is that wc reports the same information in that the number of lines will be correct in both cases, eg

$ wc -l
      

But in the first case wc does not print the file name, while in the second it does. This is because the redirected input is read via a special file handle called "standard input", whereas with files being opened explicitly by the program, the program is automatically aware of the file name and its location on disk. And with that little behavioral artifact, I have arrived at the concept centrally to all redirection and pipes (which I will get to in a minute). Unix and Linux commands write their output to "standard output", and reads input from "standard input". When you perform redirection, you literally redirect this output to go somewhere else or the reading to come from somewhere else. If you run two commands, both redirecting their output to the same file, in succession, it will cause the output from the first command to be lost. The redirection will overwrite the file. In order to append more lines to an existing file, use the double-redirection arrows (>>), like this:

$ cat /tmp/nonroot_processes.txt >/tmp/parts
$ cat /tmp/root_processes.txt >>/tmp/parts
      

This is especially useful in logging of events, i.e. when you do not want information about old events to be lost when new events are recorded. Next up: Piping. It is possible to take the output from one command and redirect it into another command running at the same time, without having to first store it in a file. This is achieved by means of the "|" pipe symbol. Example

$ ps -ef | more
      

The "more" command is made for situations where you have more lines being displayed on the screen than what can fit in the terminal, causing the lines to scroll past before you can read them. When a screen-full is reached, "more" pauses the scrolling of the output, and allows you to hit ENTER for one more line, SPACE for another screen-full, or "q" to "quit" immediately. (Note; less is the successor to more, and allows you to scroll backwards, up or down by half a screen-full, and most importantly to me, when you use it to search for words, it highlights them on the screen) The above method of "pausing" terminal output is used every day by Unix and Linux administrators world wide, especially with large files. Another example:

$ egrep -i "warn|err" /var/log/messages | less
      

The egrep command is an extended grep. In particular, it makes it easy to specify multiple search strings. The above command will show lines with either the string "warn" or "err", and the -i switch makes the search case insensitive, therefore you will also get ERR, Warn, etc.

The pipe to less is used in case there is a lot of output. Running the above command on your Linux machine every week or so will tell you a lot about its health as basically every veguely standard piece of software will log its messages to this file. It is possible to make longer pipe chains. For example

$ ps -ef | grep root | less
      

Run the ps command with switches -e and f; pipe its output into grep and filter out lines containing the string root; Pipe this output into less to make it manageable on the screen. A pipe really just redirects the output from one command into the input of another command without having to first store it on disk. You can usually only have one input and two output redirectors per command (There are some exceptions, but that is a somewhat advanced topic). That is right - two output redirectors. Earlier you saw that with redirection, the command had no output. But this command still produces output:

$ ls -l /etc/file799.xyz > /tmp/fileinfo.txt
/etc/file799.txt: No such file or directory
    

Basically the output is a message warning you that the ls command encountered an error situation. This error message is not printed on the standard output - In Unix and Linux, error messages are printed on a special, dedicated file, called "standard error". The below example uses all three redirections:

$ runreport </tmp/report_options >tmp/main_report.txt 2>tmp/report_errors.txt
      

Here the command "runreport" is some imaginary program. It will be reading the lines in the /tmp/report_options file, which supposedly controls how it reports. The report output will be stored in a file in /tmp/main_report.txt. If any errors are encountered, these will be recorded in the file /tmp/report_errors.txt You will see a 2>for redirecting standard output. The >redirection of standard output has in fact got an implied 1, for file handle number 1, which is "standard output". The three standard file descriptors are:

0 - Standard Input or STDIN
1 - Standard Output or STDOUT
2 - Standard Error or STDERR

Note that in the above examples, the file to which the standard output is being directed does not capture the error messages. It is possible to create a single file with both the normal output and any error messages interleaved, as it is often handy to see where and when a job or process produced error messages. To do this, there is a way to redirect the standard error into the standard output. (Read that again). Example

$ runreport >tmp/report_options >tmp/main_report.txt 2>&1
      

The "&1" at the end of the above line means standard output, and this will receive "2> which is standard error. This is getting quite lengthy, but a final note on syntax. You can generally insert spaces in anywhere except inside a file name or command name, which has to be a single word. So the following commands are equivalent:

$ ls -l /tmp >/tmp/fileinfo 
$ ls -l /tmp>tmp/fileinfo
      

In fact, you can put redirection anywhere on the command line, so the below is a 3rd valid and equivalent variation:

$ >tmp/fileinfo ls -l /tmp
      

However you can not break up the "2>&1" operator, or the double-redirect ">>" used for appending ouptput to existing files.

That is redirection and pipes in a nutshell. Next up is command line substitution.

1 comment:

Niki Ciurlea said...

Thank you very much. Nice presentation. I was just neded a refresh and i found this useful tutorial.