Monday, May 28, 2007

When will MS-Windows start using the Linux Kernel

Many moons ago (In April 2005) I posted a little prediction that Microsoft will start selling a product based on the Linux Kernel. This has not happened yet, but I hear other people say the same thing quite often lately. I'll quote my original slashdot journal entry here for completeness' sake:
Hmmmm
... Microsoft is making money because their Windows operating systems are popular. While it is not the subject of this Journal entry, I do want to briefly touch on why I think this is so: 1) Microsoft "allowed" us to copy and play with Windows and, as a result grow used to and become familiar with it when we were young! 2) As a result, people expect to use Windows in the workplace, be it corporate or otherwise. 3) In similar vein, Microsoft encourages game development with their free DirectX driver. This gains new followers and the cycle continues. But: I do not think Microsoft will forever be able to continue this road. In particular, I think the strength of Linux's underlying kernel is hurting Microsoft. But (on the above but) nothing stops Microsoft from building their own Linux derivative product (except maybe pride). Imagine running Linux with the full true MSFC built in, and the full MS Windows APIs available to programmers. Essentially a "windows wrapper" around the Linux kernel, sold by Microsoft. Every game, productivity application, back-office program and specialist application runs on this powerful operating system unmodified, and so does Linux and X-Windows applications (due to the built-in X-server and MS-style windowing manager). Microsoft will be able to boast online kernel upgrades, device-driver upgrades, kernel parameter tuning, and have all the Open source application in the world running on their OS instantaneously... Also all the benefits of the open source community's support carries over to this new version of MS Windows So, I see a day, not too far away, when this will be reality. In fact, sign me up as the first user of MS-Linux! P.S. Before I get any flames - I am no expert on how an MSFC call is different from an API call, or how many layers of emulation would be required to make this Windows-on-linux-kernel product a reality, but this is probably in any case a moving target.
It would only make sense. We have GNU/Linux, GNU/OpenSolaris, and eventually why not MS/Linux. Microsoft can save a packet in OS development cost, and all the Linux-vs-MS-Windows wars can stop. Back then I boldly said "sign me up as the first user", but lately the freedom of Open source software became much more important to me. I do not foresee that Microsoft will give away their MS/Linux product for free, in fact I am sure it will be a closed-source proprietary affair. But then again, Sun surprised us, so why not Microsoft? Just like Sun had, I expect Microsoft to have a long and laborious legal battle ahead of them with licenses and patents of other included products - that is if they were to ever open up their platform, so don't hold your breath.

Sunday, May 27, 2007

Tutorial: awk sees text files as rows and columns

Building on the regular expressions Tutorial I posted yesterday, this introduction to awk will, if you have never used "awk" before, revolutionize your shell scripting experience. As I hinted in the Title, the awk command "sees" its input as a collection of rows and columns. So the simplest awk command would be one which prints column 1 of a text file. Before I show an example I must mention two things about awk.
  • Awk uses the $ sign extensively in its internal language, so it is common practice to use single-quotes ('....') to escape awk "statements to protext them from the shell.
  • Awk's default "field seperator" is a collection of blanks, i.e. spaces and tabs.
Keeping this in mind, if we look at the output from a command such as ls -l:
root@linwarg:/etc# ls -l
total 1996
drwxr-xr-x  8 root   root      4096 2007-04-16 10:03 acpi
-rw-r--r--  1 root   root      2077 2006-09-09 21:08 adduser.conf
-rw-r--r--  1 root   root        46 2007-05-26 15:24 adjtime
-rw-r--r--  1 root   root        50 2006-09-09 21:24 aliases
drwxr-xr-x  2 root   root      8192 2007-05-26 16:01 alternatives
-rw-r--r--  1 root   root       395 2007-03-05 08:38 anacrontab
drwxr-xr-x  7 root   root      4096 2007-04-16 10:05 apm
...
We see that the file size is always listed in the 5th column. So lets print just the sizes:
ls -l | awk '{print ($5)}'
This highlights some more aspects of the awk scripting language.
  • The awk language surrounds statements in curly-braces.
  • The "$5" reminds of the shell's default variable referring to the 5th command-line argument, but here, being between a pair of single-quotes, it is interpreted by awk, not by the shell. As is immediately evident, the above command prints only the 5th column of every row of input.
The awk language has got two special "address" prefixes that can be given to commands, being the keywords "BEGIN" or "END". Prefixing on of these to a statement will cause the statement to be executed only once, either immediately before reading and parsing the first line of input, or in the case of END once after reading and parsing the last line of input. Now consider this more complex example:
ls -l | awk '{TOTAL=TOTAL+$5} END {print (TOTAL)}'
The awk script command consists of two separate statements. The second statement is prefixed by the END keyword. What this script does is, for each line of input encountered, add the value of the 5th field (implicitly converted to a numeric value) to the variable TOTAL. After all lines of input has been read, print the value of the TOTAL variable (implicitly converted to a text string). Note that awk initiates variables to zero/null values on first reference. The above example prints only a single line, showing the value of the TOTAL obtained. In fact, all awk commands takes addresses - The above examples both uses a default (blank) address that matches all input lines. Addresses can be specified as simple regular expressions, or as complex conditions. A typical use would be to add up only the 5th column for files with some commonality in the name, eg to add up all the files with names like "*txt", you can use:
ls -l | awk '/txt$/ {TOTAL=TOTAL+$5} END {print ("Total for txt files:", TOTAL)}'
The above line only executes the TOTAL=TOTAL+$5 statement when the expression /txt$/ is matched by the current input line. The expression is the "address", indicating when to execute the statement. Statements can be complex, for example to print the list of files being added up, with the total a the end, we can add an extra step into the command like this:
ls -l | awk '/txt$/ {print;TOTAL+=$5} END {print ("Total for txt files:", TOTAL)}'
Note - I used the short-form of the add instruction to keep the command from becoming too long. The print command use with no arguments as in the above example, simply emits the complete input line. awk has also got a printf statement which can be used with great results, for example to simply re-format the input from the ls -l command:
ls -l | awk '{printf ("%-30s %20d %s $%3d %8s %8s %s\n", $8, $5, $1, $2, $3, $4, $6, $7)}'
There are (probably many) better ways to do this, but this example shows some of the mathematical capability of awk:
echo 5 3 | awk '{printf ("%5.3f\n", $1/$2)}'
awk has got a few built-in variables which are constantly updated automatically. The NF variable , for example, reveals the number of fields on the current line. The NR variable contains the current input line (or record) number. Interestingly, the NF variable can be used in conjunction with a $, eg "$NF" to refer to the last field (or word) on a line, even when every line has got a different number of fields, eg:
ls -l | awk '{print ($NF)}'
This can be further advanced, for example to get the second-to-last field, one can use something like:
ls -l | awk '{print ($(NF-1))}'
Note the extra set of braces in the above example. The awk language justifies that awk scripts sometimes be placed in their own separate files, allowing the awk statements to be formatted with indentation, etc. This is especially useful when having many complex statements, and eliminates the problems sometimes experienced with the shell expanding/interpreting special characters in the awk script. I will in the future do a follow-up article on awk as this is only just barely scratching at the surface of its capabilities.

Tutorial: grep and regular expressions

This little article will basically be just enough to get you working with regular expressions. Also I hope that is may serve as a nice introduction to the regular expressions family of man pages to those curious to learn more.

When parsing the messages files or output from strace (Linux) or truss (Solaris) it is often useful to use slightly more complex REs than when finding an entry in the hosts file. Even when checking for the existence of an entry in for example the hosts file from a script, you may want to avoid false matches.

The grep command is a handy, simple command which we can use to test REs. The same REs typically also work with commands like sed and awk and even to a lesser extent in the shell.

To test an RE (expression) we can simply use:

grep "expression" filename

Back to basics: The simplest RE is a character that stands for itself, and by joining together a bunch of these we get a series of characters that matches a specific string. The command "grep xyz /etc/hosts" will find (and print) all lines that contains the sequence "xyz" anywhere.

To match only entries Starting or ending with a string, you can use one of these formats:

grep ^word filename
grep word$ filename

The characters ^ and $ are special regular expressions, but only when used as respectively, the first and last item in the RE. The ^word expression matches lines STARTING with word, while word$ matches lines which END with word.

An easy way to find lines which contain both of two strings is by means of two grep commands, like this:

grep "abcmno" filename | grep "mnoxyz"

This would however match lines that contain "abcmnoxyz", as well as the fact that it will not care about the order in which the two strings appear on the line. To force a check for both entries, with a specific expression being before the other, you can use some wild-cards. Example:

grep abcmno.*mnoxyz filename

Note the dot-star ... The dot says "any character", and the * says "zero-or-more-of-them"

The dot wild-cart on its own (I.e without the * as in above) will match any one character.

grep abcmno.mnoxyz filename

This is similar to using a "?" in shell file name expansion.

The next aspect of regular expressions is similar to the "dot" whildcard, but in stead of matching any character, it matches a character from a list of possibilities. Example:

grep abc[mno]xyz

This matches any of the following

  • abcmxyz
  • abcnxyz
  • abcoxyz

The [...] is used to indicate a list of possibilities. The list can be quite large, and can contain some simple sequences, like [a-z] (for a through z) or [k-p] (for k through p). Another common possiblitiy is [0-9] for any "numeric" character.

A typical use of series using this method is to make one character in a search case-insensitive, for example to find both "foo" and "Foo":

grep "[Ff]oo" filename

We can combine wildcards to search for lines which contain a single number and nothing else:

grep "^[0-9][0-9]*$" filename

  • The first [0-9] will match a numeric in the first position on the line Note the preceding ^.
  • The second [0-9]* will match zero-or-more additional numerics, that is, besides the numeral matched by the first [0-9] occurance.
  • The trailing $ means that after the last numeric, there must be and end-of-line.
  • Conversely, there cannot be anything other than a numeric character anywhere on the line.

Another use of the [...] is to exclude the list. For example to match anything which is NOT a numeric, you can use:

grep "[^0-9]$" filename

This example will find lines NOT ending in a numeric. The ^ as the first character inside the [...] set indicates that the "compliment" of the [...] sequence is to be used.

It is worth noting that in shell file-name expansion, an exclamation mark "!" can sometimes be used to indicate that the search search string must be inversed. For example listing all files starting with a single dot, but excluding the double-dot (parent-directory) can be obtained like this:

ls .[!.]*

The above can be read as list files with at least two characters in the name, the first must be a dot, the second must be anything other than a dot. Note: Some shells support use of either ^ or ! equally in this regard.

Tuesday, May 15, 2007

The silence is too loud

A lot is happening in my life, right now, but I promise that I am working on a few articles, so hang in there to all three of my regular readers. Which reminds me... What exactly is real life? Why can't my online life be a valid part of my real life? I am of the opinion that phrases like "having a life" and "get a life" are irrelevant, because I need to interact with people and websites, just as much as I need to eat and sleep. Happiness is something that has got a lot of bearing on this. I don't watch much TV, couldn't care less about jock sports, and would rather "spend my life" in front of the computer than by going to parties, sport matches and ... whatever else passes for "having a life" these days. Oh, sure I can appreciate the occasional movie. However I just normally prefer to be in control of my entertainment. I like to steer my own path through the web as I read up on whatever topic I am busy investigating, and find watching TV and spectator sports too boring due to the unidirectional, uninvolved nature of it. It feels like I'm being spammed with other people's lives! Many people have written pieces on their school-days experiences with so-called sports jocks, and while I have had my share of these encounters, I will not waste space by just echoing these sentiments. However one thing that I do want to say is that many people don't realize, or have no tolerance for, is that there are different ways in which one can live a happy, fulfilling, and satisfactory life. Mine happens to use a tool called "The Computer" quite often. It is free of much of the elements of the world that I dislike, and those elements which it does contain are usually things I can ignore or avoid. At any rate, I promise not to let the blog deteriorate into yet another dust-collector.

Thursday, May 3, 2007

USB gadgets abound

Today I realized I should not gloss over two seemingly simple items, the first being pipes and redirection, and the other being substitution. This article will explain the first of these for the benefit of those new to Linux and Unix how to link together simple commands to build powerful tools.

Before I start, a quick introduction to the Unix command-line syntax is appropriate.

  • The simplest Unix command is just a command word, which is the full path name of the file to be executed, starting with a slash representing the root. For example, if you enter /usr/bin/ls, the shell will run the ls program found in the /usr/bin directory.
  • Similarly if you enter a command without specifying the directory, the shell will search all the directories specified in the PATH variable, in the order they are listed, and will execute the first program it finds with the name specified.
  • Slightly more complex commands take options and/or arguments. These are extra "words" which modifies the behavior of the program. For example the ps program lists processes associated with your terminal session, but if you add the "-e" option, it will list all processes running on the system.
  • Simple options which turn on a different behavior in the command are called switches, and these can be stacked. For example the -f switch, which causes ps to include additional details about each listed process and the -e switch mentioned above, can be combined, like this:
$ ps -ef
    

& to produce a listing of "every process" with "full defaults" A note on the formatting in this article: When a line is preceded with a "$", it means that it is a command to be entered at the shell prompt, though the $ itself must not be entered. The usual shell prompt is a $ for normal users, or a # for root, though your system may well show additional details in the prompt string. Lines in blue text and that does not start with a "$" is the output generated by the command. There is more to commands, but this is enough to be able to understand the rest of this tutorial, so without further delay on to the real subject: Redirection and Pipes. Redirection allows you to "store" the output from a command directly into a file on disk, or to read lines from a file on disk.

$ ps -ef >tmp/processlist.txt
    

If you run the command above you get ... nothing, and you get it fast.. That's right. The ps -ef command on its own produces the expected output, but the ">tmp/processlist.txt" portion causes this output to go into the file /tmp/processlist.txt in stead of appearing in your terminal. You can examine the newly created file with the ls command (Which "lists" information about files and/or directories)

$ ls -l /tmp/processlist.txt 
-rw-r--r-- 1 user1 user1 23100 Jun 4  10:06 /tmp/processlist.txt
    

And you can actually view its contents with any one of the "cat", "less" or "more" commands, like this:

$ cat /tmp/processlist.txt
    

Now that you have this file on the disk you can use it over and over. Enter the above cat command again, and it shows the file contents again. Much more interesting tough, is to filter out specific lines from the file. The head command prints the first 10 lines from the specified file.

$ head /tmp/processlist.txt 
$ tail -4 /tmp/processlist.txt
    

And you guessed it, tail prints the last 10 lines. Specifying a switch with head or tail, like the "-4" above lets you to control the number of lines being displayed if you want something other than the default 10. Much more interesting than "head" or "tail" is grep, which I used extensively in a previous tutorial.

The "grep" command filters lines out based on a special rule for searching, called a "regular expression". These can be quite complex, but does not always need to be, e.g:

$ grep root /tmp/processlist.txt
    

The above command reads the specified file and prints only the lines that contain the string "root". You can also get grep to do the inverse, like this:

$ grep -v root /tmp/processlist.txt
    

Note the "-v" switch which modified grep's behavior to print lines not matching the expression. In the above examples, "grep", "head", "tail" and "cat" were expressly told what file to open and search through. The /tmp/processlist.txt filename specified is "an argument" of each command. Combining grep searches with redirection allows us to create more stored files for further parsing.

$ grep root /tmp/processlist.txt >tmp/root_processes.txt 
$ grep -v root /tmp/processlist.txt >tmp/nonroot_processes.txt
     

This then creates two files, with complementary information - the first with all the lines containing the string root in the file /tmp/root_processes.txt, and the other file with all the lines that does not contain this string. In the redirection examples above the "greater than sign" (>) was used like an arrow pointing from the command ... to the file. I mentioned above that redirection can also read from a file, in which case you just need to switch the direction of the "arrow" For example

$ wc -l < /tmp/nonroot_processes.txt
    

Will read the file /tmp/nonroot_processes.txt and print the number of lines found. (The wc command is the so-called "word count command", and the -l switch is used to modify its behavior to count lines in stead. The "grep", "head", "tail" and "cat", etc commands could also have been used in this fashion. See for example:

$ cat
      

which is almost identical to

$ cat /tmp/root_processes.txt
      

In fact, the difference is of academic importance only, and has got to do with how the file is opened and who owns the file handle. The output is identical. However in the case of wc, the output is not the same. GASP! How can wc report something different depending on when the file was opened after I just said where the file is opened is of academic importance only?The answer is that wc reports the same information in that the number of lines will be correct in both cases, eg

$ wc -l
      

But in the first case wc does not print the file name, while in the second it does. This is because the redirected input is read via a special file handle called "standard input", whereas with files being opened explicitly by the program, the program is automatically aware of the file name and its location on disk. And with that little behavioral artifact, I have arrived at the concept centrally to all redirection and pipes (which I will get to in a minute). Unix and Linux commands write their output to "standard output", and reads input from "standard input". When you perform redirection, you literally redirect this output to go somewhere else or the reading to come from somewhere else. If you run two commands, both redirecting their output to the same file, in succession, it will cause the output from the first command to be lost. The redirection will overwrite the file. In order to append more lines to an existing file, use the double-redirection arrows (>>), like this:

$ cat /tmp/nonroot_processes.txt >/tmp/parts
$ cat /tmp/root_processes.txt >>/tmp/parts
      

This is especially useful in logging of events, i.e. when you do not want information about old events to be lost when new events are recorded. Next up: Piping. It is possible to take the output from one command and redirect it into another command running at the same time, without having to first store it in a file. This is achieved by means of the "|" pipe symbol. Example

$ ps -ef | more
      

The "more" command is made for situations where you have more lines being displayed on the screen than what can fit in the terminal, causing the lines to scroll past before you can read them. When a screen-full is reached, "more" pauses the scrolling of the output, and allows you to hit ENTER for one more line, SPACE for another screen-full, or "q" to "quit" immediately. (Note; less is the successor to more, and allows you to scroll backwards, up or down by half a screen-full, and most importantly to me, when you use it to search for words, it highlights them on the screen) The above method of "pausing" terminal output is used every day by Unix and Linux administrators world wide, especially with large files. Another example:

$ egrep -i "warn|err" /var/log/messages | less
      

The egrep command is an extended grep. In particular, it makes it easy to specify multiple search strings. The above command will show lines with either the string "warn" or "err", and the -i switch makes the search case insensitive, therefore you will also get ERR, Warn, etc.

The pipe to less is used in case there is a lot of output. Running the above command on your Linux machine every week or so will tell you a lot about its health as basically every standard Linux and Unix component will log its messages to this file. It is possible to make longer pipe chains. For example

$ ps -ef | grep root | less
      

Run the ps command with switches -e and f; pipe its output into grep and filter out lines containing the string root; Pipe this output into less to make it manageable on the screen. A pipe really just redirects the output from one command into the input of another command without having to first store it on disk. You can usually only have one input and two output redirectors per command (There are some exceptions, but that is a somewhat advanced topic). That is right - two output redirectors. Earlier you saw that with redirection, the command had no output. But this command still produces output:

$ ls -l /etc/file799.xyz > /tmp/fileinfo.txt
/etc/file799.txt: No such file or directory
    

Basically the output is a message warning you that the ls command encountered an error situation. This error message is not printed on the standard output - In Unix and Linux, error messages are printed on a special, dedicated file, called "standard error". The below example uses all three redirections:

$ runreport </tmp/report_options >tmp/main_report.txt 2>tmp/report_errors.txt
      

Here the command "runreport" is some imaginary program. It will be reading the lines in the /tmp/report_options file, which supposedly controls how it reports. The report output will be stored in a file in /tmp/main_report.txt. If any errors are encountered, these will be recorded in the file /tmp/report_errors.txt You will see a 2>for redirecting standard output. The >redirection of standard output has in fact got an implied 1, for file handle number 1, which is "standard output". The three standard file descriptors are:

0 - Standard Input or STDIN
1 - Standard Output or STDOUT
2 - Standard Error
or STDERR

Note that in the above examples, the file to which the standard output is being directed does not capture the error messages. It is possible to create a single file with both the normal output and any error messages interleaved, as it is often handy to see where and when a job or process produced error messages. To do this, there is a way to redirect the standard error into the standard output. (Read that again). Example

$ runreport >tmp/report_options >tmp/main_report.txt 2>&1
      

The "&1" at the end of the above line means standard output, and this will receive "2> which is standard error. This is getting quite lengthy, but a final note on syntax. You can generally insert spaces in anywhere except inside a file name or command name, which has to be a single word. So the following commands are equivalent:

$ ls -l /tmp >/tmp/fileinfo 
$ ls -l /tmp>tmp/fileinfo
      

In fact, you can put redirection anywhere on the command line, so the below is a 3rd valid and equivalent variation:

$ >tmp/fileinfo ls -l /tmp
      

However you can not break up the "2>&1" operator, or the double-redirect ">>" used for appending ouptput to existing files. That is redirection and pipes in a nutshell. Next up is command line substitution.