Today I realized I should not gloss over two seemingly simple items, the
first being pipes and redirection, and the other being substitution. This article will
explain the first of these for the benefit of those new to Linux and Unix how to link
together simple commands to build powerful tools.
Before I start, a quick introduction to the Unix command-line syntax is
appropriate.
- The simplest Unix command is just a command word, which is the full path
name of the file to be executed, starting with a slash representing the root. For example,
if you enter /usr/bin/ls, the shell will run the ls program found in the /usr/bin
directory.
- Similarly if you enter a command without specifying the directory, the
shell will search all the directories specified in the PATH variable, in the order they are
listed, and will execute the first program it finds with the name specified.
- Slightly more complex commands take options and/or arguments. These are
extra "words" which modifies the behavior of the program. For example the ps program lists
processes associated with your terminal session, but if you add the "-e" option, it will
list all processes running on the system.
- Simple options which turn on a different behavior in the command are
called switches, and these can be stacked. For example the -f switch, which causes ps to
include additional details about each listed process and the -e switch mentioned above, can
be combined, like this:
$ ps -ef
& to produce a listing of "every process" with "full defaults" A note on
the formatting in this article: When a line is preceded with a "$", it means that it is a
command to be entered at the shell prompt, though the $ itself must not be entered. The usual
shell prompt is a $ for normal users, or a # for root, though your system may well show
additional details in the prompt string. Lines in blue text and that does not start with a
"$" is the output generated by the command. There is more to commands, but this is enough to
be able to understand the rest of this tutorial, so without further delay on to the real
subject: Redirection and Pipes. Redirection allows you to "store" the output from a command
directly into a file on disk, or to read lines from a file on disk.
$ ps -ef >tmp/processlist.txt
If you run the command above you get ... nothing, and you get it fast..
That's right. The ps -ef command on its own produces the expected output, but the
">tmp/processlist.txt" portion causes this output to go into the file /tmp/processlist.txt
in stead of appearing in your terminal. You can examine the newly created file with the ls
command (Which "lists" information about files and/or directories)
$ ls -l /tmp/processlist.txt
-rw-r--r-- 1 user1 user1 23100 Jun 4 10:06 /tmp/processlist.txt
And you can actually view its contents with any one of the "cat", "less"
or "more" commands, like this:
$ cat /tmp/processlist.txt
Now that you have this file on the disk you can use it over and over. Enter
the above cat command again, and it shows the file contents again. Much more interesting
tough, is to filter out specific lines from the file. The head command prints the first 10
lines from the specified file.
$ head /tmp/processlist.txt
$ tail -4 /tmp/processlist.txt
And you guessed it, tail prints the last 10 lines. Specifying a switch with
head or tail, like the "-4" above lets you to control the number of lines being displayed if
you want something other than the default 10. Much more interesting than "head" or "tail" is
grep, which I used extensively in a previous tutorial.
The "grep" command filters lines out based on a special rule for
searching, called a "regular expression". These can be quite complex, but does not always
need to be, e.g:
$ grep root /tmp/processlist.txt
The above command reads the specified file and prints only the lines that
contain the string "root". You can also get grep to do the inverse, like this:
$ grep -v root /tmp/processlist.txt
Note the "-v" switch which modified grep's behavior to print lines
not matching the expression. In the above examples, "grep",
"head", "tail" and "cat" were expressly told what file to open and search through. The
/tmp/processlist.txt filename specified is "an argument" of each command. Combining grep
searches with redirection allows us to create more stored files for further
parsing.
$ grep root /tmp/processlist.txt >tmp/root_processes.txt
$ grep -v root /tmp/processlist.txt >tmp/nonroot_processes.txt
This then creates two files, with complementary information - the first with
all the lines containing the string root in the file /tmp/root_processes.txt, and the other
file with all the lines that does not contain this string. In the redirection examples
above the "greater than sign" (>) was used like an arrow pointing from the command ... to
the file. I mentioned above that redirection can also read from a file, in which case you
just need to switch the direction of the "arrow" For example
$ wc -l < /tmp/nonroot_processes.txt
Will read the file /tmp/nonroot_processes.txt and print the number of lines
found. (The wc command is the so-called "word count command", and the -l switch is used to
modify its behavior to count lines in stead. The "grep", "head", "tail" and "cat", etc
commands could also have been used in this fashion. See for example:
$ cat
which is almost identical to
$ cat /tmp/root_processes.txt
In fact, the difference is of academic importance only, and has got to do
with how the file is opened and who owns the file handle. The output is identical. However in
the case of wc, the output is not the same. GASP! How can wc report something different
depending on when the file was opened after I just said where the file is opened is of
academic importance only?The answer is that wc reports the same information in that the
number of lines will be correct in both cases, eg
$ wc -l
But in the first case wc does not print the file name, while in the second
it does. This is because the redirected input is read via a special file handle called
"standard input", whereas with files being opened explicitly by the program, the program is
automatically aware of the file name and its location on disk. And with that little
behavioral artifact, I have arrived at the concept centrally to all redirection and pipes
(which I will get to in a minute). Unix and Linux commands write their output to "standard
output", and reads input from "standard input". When you perform redirection, you literally
redirect this output to go somewhere else or the reading to come from somewhere else. If you
run two commands, both redirecting their output to the same file, in succession, it will
cause the output from the first command to be lost. The redirection will overwrite the file.
In order to append more lines to an existing file, use the double-redirection arrows
(>>), like this:
$ cat /tmp/nonroot_processes.txt >/tmp/parts
$ cat /tmp/root_processes.txt >>/tmp/parts
This is especially useful in logging of events, i.e. when you do not want
information about old events to be lost when new events are recorded. Next up: Piping. It is
possible to take the output from one command and redirect it into another command running at
the same time, without having to first store it in a file. This is achieved by means of the
"|" pipe symbol. Example
$ ps -ef | more
The "more" command is made for situations where you
have more lines being displayed on the screen than what can fit in the terminal, causing the
lines to scroll past before you can read them. When a screen-full is reached, "more" pauses the scrolling of the output, and allows you to hit ENTER for one
more line, SPACE for another screen-full, or "q" to "quit" immediately. (Note; less is the
successor to more, and allows you to scroll backwards, up or down by half a screen-full, and
most importantly to me, when you use it to search for words, it highlights them on the
screen) The above method of "pausing" terminal output is used every day by Unix and Linux
administrators world wide, especially with large files. Another example:
$ egrep -i "warn|err" /var/log/messages | less
The egrep command is an extended grep. In particular, it makes it easy to
specify multiple search strings. The above command will show lines with either the string
"warn" or "err", and the -i switch makes the search case insensitive, therefore you will also
get ERR, Warn, etc.
The pipe to less is used in case there is a lot of output. Running the above
command on your Linux machine every week or so will tell you a lot about its health as
basically every standard Linux and Unix component will log its messages to this file. It is
possible to make longer pipe chains. For example
$ ps -ef | grep root | less
Run the ps command with switches -e and f; pipe its output into grep and
filter out lines containing the string root; Pipe this output into less to make it manageable
on the screen. A pipe really just redirects the output from one command into the input of
another command without having to first store it on disk. You can usually only have one input
and two output redirectors per command (There are some exceptions, but that is a somewhat
advanced topic). That is right - two output redirectors. Earlier you saw that with
redirection, the command had no output. But this command still produces output:
$ ls -l /etc/file799.xyz > /tmp/fileinfo.txt
/etc/file799.txt: No such file or directory
Basically the output is a message warning you that the ls command
encountered an error situation. This error message is not printed on the standard output - In
Unix and Linux, error messages are printed on a special, dedicated file, called "standard
error". The below example uses all three redirections:
$ runreport </tmp/report_options >tmp/main_report.txt 2>tmp/report_errors.txt
Here the command "runreport" is some imaginary program. It will be reading
the lines in the /tmp/report_options file, which supposedly controls how it reports. The
report output will be stored in a file in /tmp/main_report.txt. If any errors are
encountered, these will be recorded in the file /tmp/report_errors.txt You will see a
2>for redirecting standard output. The >redirection of standard output has in fact got
an implied 1, for file handle number 1, which is "standard output". The three standard file
descriptors are:
0 - Standard Input or STDIN
1 - Standard Output or STDOUT
2 - Standard Error
or STDERR
Note that in the above examples, the file to which the standard output is
being directed does not capture the error messages. It is possible to create a single file
with both the normal output and any error messages interleaved, as it is often handy to see
where and when a job or process produced error messages. To do this, there is a way to
redirect the standard error into the standard output. (Read that again). Example
$ runreport >tmp/report_options >tmp/main_report.txt 2>&1
The "&1" at the end of the above line means standard output, and this
will receive "2> which is standard error. This is getting quite lengthy, but a final note
on syntax. You can generally insert spaces in anywhere except inside a file name or command
name, which has to be a single word. So the following commands are equivalent:
$ ls -l /tmp >/tmp/fileinfo
$ ls -l /tmp>tmp/fileinfo
In fact, you can put redirection anywhere on the command line, so the below
is a 3rd valid and equivalent variation:
$ >tmp/fileinfo ls -l /tmp
However you can not break up the "2>&1" operator, or the
double-redirect ">>" used for appending ouptput to existing files.
That is redirection and pipes in a nutshell. Next up is command line
substitution.