Initial Program Load

Why free software matters - even if you don't use it.

2015-08-27T17:38:00.001+02:00

By “Free” software here I mean as in the FOSS (Free and Open Source Software) sense – software that truly does not impede on your rights.

In fact the $0 price tag of free the software is the least of its benefits. While you may download and install it to use for both private and business purposes, the real benefit is the freedom that comes with open-Source software. Open-source software puts you under no obligations. I won't go into the definition of free software any further here, but the Four Freedoms of Open Source software is canonically defined here : http://www.gnu.org/philosophy/free-sw.en.html

Note that I did not say ‘no obligation to pay’. No obligation whatsoever. Not even to keep on using it, which may raise an eyebrow here and there.

One of the ways in which commercially licensed software (and some free software that isn't actually Open) catches you is by preventing you from switching. Software cost-of-ownership calculations, when done properly, include not only the entry and running cost, but also the exit cost. When the software makes it difficult or very expensive to switch to another product, you are basically locked in. If for example your word processor produces documents in a format that cannot be used by other, competing products, then you are caught unless you can find a way to convert your documents to a portable format, a process which may be difficult and costly. Once you are so caught, the vendor can do some nasty things, like force you to store your documents on their servers, to pay a subscription, and very importantly, force you to upgrade.

Another important consideration is your assurance that the software will not compromise your privacy or security. When you cannot examine the code of your programs (who of us really can – even when you have the skill, commercial software vendors will not give you the source code, and even when you have the code, for a single program like a word processor this is a task that can keep a single person occupied for a lifetime, for an operating system it is a task that can occupy teams of people forever) you fall back upon trust. When you buy a program you implicitly trust it, and by extension, the vendor who created it.

Should you trust this vendor? Can you trust them? You cannot read the minds of every employee at the company, but you can rest assured that some of them do not have your best interests at heart, and that the company itself is in it to get as much of your money as possible. Competition is a good thing though - it keeps software creators on their toes.

Upgrades are not always a good thing, especially when forced down your throat. The new version may include features that compromise your privacy or security, and the vendor may change their license agreement wording and even their licensing model!

The one thing you can use as a guide to judging whether you can trust a software vendor is its reputation. Of course this is far from foolproof, but it is an important aspect none the less. I will come back to this.

Truly open software guarantees that your freedom will never be compromised. This is because Open-source software is based in a community, not in a company. Even if you don't actively read through the source code, millions of other people (in the community) will, changing it, and submitting enhancements and fixes. Some examples:

The Gimp (a photo editing application) has had more than 35,000 commits by 535 contributors.
Krita (an illustration program) has had more than 97,000 commits by 577 contributors.
Kate (An Advanced text editor suitable for programming) has had 14,900 commits by 367 contributors.
The Linux Kernel has had 598,000 commits by 14,500 contributors.
KDElibs, the basis of the KDE desktop environment, has had 101,000 commits by 1151 contributors.
LibreOffice (a productivity suite) has had some 43,800 commits by 518 contributors.

It is easy to see that a lot of people are working hard to maintain and improve free software.

Software makers still need to guard their reputation - if they compromise their users, these users will start to look elsewhere to spend their money. But when there is no competition they can do as they please. Free software provides a real alternative to commercial software and so gives us options. Therefore even if you don't use the free, open-source alternatives available, the mere existence of these options helps to protect your rights.

There are a few things you can do to help maintain the status quo. Use open- source software, maybe even contribute on an open-source project. Or simply donate to an open-source software project, such as the upcoming 2015 Randa KDE sprint - https://www.kde.org/fundraisers/kdesprints2015/

Protecting against the Logjam vulnerability using AWS Elastic Load Balancer

2015-05-21T14:28:00.001+02:00

If you use Amazon's ELB to handle your TLS (I highly recommend this) you can protect your users and yourself against Logjam by changing a single setting in the AWS console.

Instructions with screenshots:

Open the EC2 dashboard in the Amazon Web Services Console.
On the left under Network Security click on "Load Balancers"
Select the Load Balancer instance and click on "Change" in the "Cypher" column.
The "Select a Cipher" window will pop up, (though you can select multiple ciphers)
Select "Predefined Security Policy" and then select the new item named "ELBSecurityPolicy-2015-05"
Select Save and close the window.

The change is effective immediately and without interruption. Test access to your site using all the browsers that you care about. Then repeat steps 3 to 6 for any other Load Balancer Instances you have.

Also try the "Server Test" function against your site URL from the PFS Deployment Guide

This will take only 10 seconds and it offers your users an important level of protection. Don't imagine that you are not a target.

Die Konqueror Die

2014-08-19T10:41:00.000+02:00

What is the benefit of having a browser that doesn't work? I would argue that it actually detracts from the experience of using the environment.

KDE does not gain or loose followers for having or not having a native browser. Good support to render web sites well in practice is the only criteria, and arguably the only browsers that serve are Firefox and Chrome.

I'm all for competition, but now that Konqueror is looking for a maintainer is as good a time as any to shelve the product, write it off and re-allocate those resources to making Chrome and Firefox integrate better with KDE. That would arguably gain KDE more users and make existing ones happier - it is a cause with more value.

P.S. That the web browser also functions as a document viewer and file manager is irrelevant. Nobody uses it for those function five minutes after they discovered the need for one of the better, dedicated programs, if ever they did.

P.P.S I'm not even going to mention the other disfunctional KDE web browser.

How to update your CNTLM password

2014-01-07T14:13:00.002+02:00

CNTLM is awesome for enabling the use of Linux in a Microsoft dominated workplace. In particular getting onto the Internet when you need to authenticate with Microsoft Domain credentials.

CNTLM runs on your Linux system as a small proxy server. It received requests for connections to web based services and adds the necessary Microsoft authentication Meta-data to the outgoing packets before forwarding them on to the upstream "corporate" proxy servers.

For this CNTLM stores your credentials in a text file, usually /etc/cntlm.conf ... This file is checked on start-up.

When your Domain password changes, you need to "inform" CNTLM of the new password to use to get past the corporate proxy system. This is done by updating the cntlm.conf file and restarting the CNTLM service.

I recommend that, because MS domain and authentication data often "takes a while to propagate to all services", that the following overall process be followed.

Prevent any programs/devices from using old passwords (otherwise these may get you locked out of the network)
Change your Domain password. Write it down (In a safe place)
Go have a cup of coffee or do whatever you like, give the network "a while" (20 minutes) to propagate your new password.
Log off and back on, access the Internet via MS Internet Explorer, access your web-based exchange, etc... what ever you find convenient to make sure that your password is updated throughout the network.
Follow the steps below to update your CNTLM password.
Re-enable programs and devices with your updated password. This may include your Exchange account on your smart phone, proxy settings in your Linux package manager etc.

The CNTLM configuration file stores the following authentication details:

MS domain Name
MS domain user name
MS domain authentication type (Usually NTLMv2)
MS domain user password or a Hashed version of the password

It is recommended to use a hashed version of the password (in stead of the actual password) in the configuration file. CNTLM includes a way of generating the updated hash so that you do not need to store the password in plain text, which adds a layer of security to the system, eg besides the fact that the file is readable only by the superuser.

Start by preventing any applications from using the proxy - I use ProxyManager to disable the Proxy settings everywhere, ensuring that nothing will try to connect while the update is in progress.

johan@Komputer:~$ p-off
Disabling for KDE global
Disabling for S3cmd
Disabling for Dropbox
Disabling for VirtualBox
Disabling for Git global
Disabling for Wine IE
Disabling for Curl
Disabling for wget
Disabling for APT
Disabling for Root Bash
Disabling proxy for Root Curl

CNTLM must run when you update the password - start it if it is not.

johan@Komputer:~$ ps -ef|grep cntlm
cntlm     2102     1  0 Jan04 ?        00:00:05 /usr/sbin/cntlm -U cntlm -P /var/run/cntlm/cntlm.pid
johan    31452 30162  0 10:02 pts/2    00:00:00 grep --color=auto cntlm

You need to "be root" to update the CNTLM configuration

johan@Komputer:~$ sudo -s
root@Komputer:~# cntlm -IM http://test.com
Password: 
Config profile  1/4... OK (HTTP code: 302)
----------------------------[ Profile  0 ]------
Auth            NTLMv2
PassNTLMv2      FEDCBA9876543210CC747CDB22103C1D
------------------------------------------------

What happens is that CNTLM Prompts for a new password, and uses the Domain and User details from the config file to connect to the test URL provided. It tries all the known methods of authentication, and when a working method is found it displays the hash and method of authenticating.

Use a text editor to update the configuration file using the displayed details. Save the file and then restart CNTLM to get it to read the updated hash from the configuration file.

root@Komputer:~# /etc/init.d/cntlm restart
Stopping CNTLM Authentication Proxy: cntlm.
Starting CNTLM Authentication Proxy: cntlm.                                                                                                                    
root@Komputer:~# exit

Finally re-enable the Proxy in all applications.

johan@Komputer:~$ p-on
Enabling for APT
Enabling for Root Bash
Enabling for Root Curl (eg for Yast)
Enabling for KDE global
Enabling for wget
Enabling for Curl
Enabling for S3cmd
Enabling for VirtualBox
Enabling for Git global
Enabling for Wine IE
Enabling for Dropbox

Now if only there were a way to change the password for CNTLM, Contacts sync on my Android phone, Calendar Sync on my Tablet, MS Linc client on my phone, AND on the Microsoft domain, all at once.

The year of the Linux Desktop rode out on a Unicorn

2013-12-17T14:07:00.000+02:00

The year of the Linux desktop is a myth. It will never happen. It has not happened. And it isn't in progress either.

Oh, make no mistake. Chrome OS is not even the second to latest entrée. Right after Canonical's Edge Ubuntu Phone came the Free Forever Steam OS.

Of course canonical is a little evil: They want to make money off of Linux. We used to hate Red Hat for that, we turned our backs on them, and they went ahead and captured the internet server market regardless. I can't help it, even yours truly is Red Hat certified.

Canonical arguably single handedly put Linux on the Desktop market. It is much, much easier to install software on Linux than it is to do so on Windows. If you are not already used to the Windows convoluted way of doing things, then learning how to navigate your way around a Gnome or KDE desktop is certainly a lower learning curve. These days even printing and scanning works better on Linux, and PDF's just stops being impenetrable and immutable solid files.

To put a case in point: When the Windows "Fix Network Connection" function doesn't work, go ahead and try to figure out why. When your (wife's) Windows system (again) doesn't want to print, go ahead and try to figure out why. After all she didn't change anything and it worked yesterday.

Kubuntu is beautiful without the new-fangled desktop paradigm that realy doesn't belong on a non-touch based system. Linux is beautiful. Ubuntu made Linux user friendly.

But we can not give Canonical all the benefit. Google took a page out of Microsoft's book and gave us Android with Games. Yes, when 1989's Windows gamers grew up and went to work in corporates they did not expect OS/2 on their office PCs, they wanted Windows. Because Microsoft neglected to prosecute them in the nineties and let them play games when they were teenagers.

And Yes, Google is also a little bit evil (aka trying to make money out of us, trying to prevent us from going anywhere else with our data, spying on our search and email and buying habits)... but in the mean time they are building out on the Linux base, and I bow to them for that. More users = more justification for big players (AMD, Intel, nVidia) to support Linux. Not that there are any real alternative options, but still.

I must say I honestly do not care whether my Device Drivers are open source or not. For all I care the driver can secretly use my GPU to generate BitCoins for AMD whenever my PC is idle. All I care about is that my Device driver works well, supports all the hardware features, works on Linux, be supported and updated, and be included with the cost of the hardware. As if I ever read the code to make sure there are no backdoors.

Linux is so prevalent these days that it is becoming nearly a household name. I do blame Google a little for not making it more obvious that Android is based on Linux, but that is just PR - Linux has got a stigma that it is not for the average Joe attached to it. Which android is.

Which reminds us of Java. Can anybody remember who created Java. Up until 3 years ago we all had Java. It was something on any phone that could download and run apps. It ran web based games and it did everything in between. Those with very keen eye sight might have noticed the minutely small Sun Microsystems logo in the bottom right corner of the Java web page. I call it a glaring, stupid failure to capitalize on an opportunity to market. Sun Microsystems were in everybody's homes, but IBM were buying full page adds in Computer and gaming magazines. Everybody knows IBM = Computers, but sadly Sun Microsystems, the original graphical workstation makers, are now little more than a memory for many, and essentially never were known outside of the core industry.

The Desktop came and went without Linux making it.

Why hard drives are smaller than expected

2012-12-13T12:53:00.002+02:00

People often ask why their Terabyte hard drive isn't a terabyte and time and again the simple, not necessarily false, answer given is that it is a marketing ploy by the evil manufacturers. But there is another answer.

In the good old days there was only the SI units for prefixes - A thousand meters in a kilometer, a thousand grams in a kilogram, and that is how we like it. Engineers and Scientists insist that it be so, well mostly they do. The SI standards organisation defines the prefixes this way.

The binary nature of digital computers lends itself to working with powers of two for units. The problem comes in with how close 1000 happens to be to the value 2¹⁰ - the difference was considered negligible while designing computers and writing early computer system manuals. The habit stuck and the prefix "Kilo" in computer terms became interchangeable for the value 1024. It would be wasteful to use 1000 as the demarcation for many computer allocation units because powers of two allign well and make for more effecient and cost-effective designs.

This limitation is however specific to situations where bits are processed, stored or transfered in parallel. This includes processors, memory banks and system busses. Serial media, such as communication lines, networks, and hard drives do not suffer from this limitation. (It must be noted that while it is convenient to think of data stored on a hard drives as parallel bits, natively hard drives, just like tape devices, read and write bits in serial.)

A "kilo"-byte turns out to be a convenient measure for quantity of data. The difference also appears negligible at first glance, and using it this way feels comfortable to humans. Note however that the margin of error increases as we move to higher order numbers.

Prefix Order Binary prefix value Decimal prefix Value Deviation
Kilo 1 1,024   1,000   2.40%
Mega 2 1,048,576  1,000,000  4.86%
Giga 3 1,073,741,824  1,000,000,000  7.37%
Tera 4 1,099,511,627,776 1,000,000,000,000 9.95%
Peta 5 1,125,899,906,842,620 1,000,000,000,000,000 12.59%

The net effect is that a Terabyte hard drive is nearly 10% less than what you would expect its size to be!

As mentioned earlier, not all devices on a computer operate in parallel: networks are mostly serial lines. The phone and Digital lines that connects our homes to the Internet communicate in serial. The venerable computer mouse is a serial device. These days the USB protocol is used for just about anything and the "S" in USB in fact stands for "Serial".

Because hard drives in actual fact store data in serial (even "parallel" drives like ATA and SCSI drives eventually convert the data to a serial stream of ones and zeros), they follow the SI prefix specification for number of Bytes in a Gigabyte, while memory modules, which must maximize the investment in bus width and capacity, incorrectly follows a binary interpretation of the decimal prefixes!

The SI system only recognizes the powers-of-ten meaning of the prefixes. A new set of binary prefixes have been defined, though it is not part of the SI standard!

Kilo 1,000 = KB  1,024 = KiB (Kibi Byte)
Mega 1,000,000 = MB  1,048,576 = MiB (Mibi Byte)
Giga 1,000,000,000 = GB 1,073,741,824 = GiB (Gibi Byte)

Hard Drives, Modems, network cards, and airoplanes are designed by engineers following the SI standard and their size specification conforms to the traditional SI meanings. Memory modules follow the size specifications of the Binary prefix system, but marketing brands these with SI decimal prefixes. We as consumers are therefore spoiled since we get more than what we pay for with RAM!

There are however two other items worth mentioning.

The first is Solid State storage devices, such as flash drives. Like RAM these are absed on a natively paralle media and bits needs to be counted and maximized for optimal capacity and effectiveness. Yet these are marketed the same way traditional hard drives are - with the SI meaning of GB or Gigabyte. You would think that (ignoring file system overheads) you should be able to store a GB of data from ram into a 1-GB solid state drive! Blame this one on marketing and exploitation of the people who have come to expect a "1 GB hard drive to be less than 1 GB"

The second is the size of files stored on a hard drive. These are commonly shown with KB having the binary system meaning in stead of the SI meaning. This is despite common storage media used to be natively formatted as serially accessible streams - hard drives and tapes. I assume this may be in part because the writers of the early general purpose operating systems were so deeply ingrained in thinking about a Kilo-byte as 1024 bytes that they never considered doing it the other way, and possibly because those files had to be loaded into core memory which is allocated in chunks which have sizes that are powers-of-two.

So there you have it - don't blame the marketing guys for the missing space on your hard drive, thank them for the extra space on your memory modules. Blame the engineers though. :-)

Finding space for Solaris Live Upgrade

2012-11-14T13:57:00.000+02:00

Something that is often perceived to be an obstacle to using Solaris Live Upgrade is finding space to give to Live upgrade. There are fortunately quite a few options to help out.

Oracle of course recommends that you use spare disks or buy more disks. That is all well and fine for big corporates with deep pockets... assuming that you have slots available to plug in more disks.

So … on to the more attainable options.

To start off it helps to know how much space you will need to perform the actual upgrade. Solaris itself needs about 7GB for a Full plus OEM installation, excluding logs, crash dumps, spool files, and so on. Use the df command to check how much space is used by the root and other critical file systems (/var, /opt, /usr). While this is a good starting point, you may not need to replicate all of that if it includes 3rd-party software that stays the same.

Since you are also able to combine and split file systems, as well as explicitly “exclude” portions of the boot environment through the -x or -f options, it is possible to get an estimate of the amount of space needed from Live Upgrade. To find out how much space Live Upgrade will need, run the lucreate command that you plan on using up to the point where it displays the estimated space requirement, and then press Ctrl-C to abort it.

Option 1. By far the simplest scenario is if you are running on a ZFS root system already, then you are in luck: Live Upgrade has got good support for ZFS root, at least in recent versions of the tool, eg since Solaris 10 Update 6. It can take a snapshot of the root file system(s) and create a clone automatically, and then simply apply changes, like patches or an upgrade, to the clone.

ZFS makes it almost too easy, the command is simply:

# lucreate -c Sol10u5 -n Sol10u8

The command will automatically locate the Active Boot Environment (PBE) and utilise free space from the pool (It checks the active BE to determine what ZFS pool is to be used). The -c option above causes the Active BE to be named explicitly (Sol10u5 in this example), and the -n assigns a name for the new Alternate BE, eg Sol10u8. (Let's just assume I'm going from update 5 to update 8)

It is a good idea to name your BEs based on the version of the operating system that they have, especially on ZFS; With ZFS it is (too) easy to have many BEs, eg for testing and backup purposes.

When using ZFS clones, the space required to create a new BE is less than with other file systems. This is because the contents of the “clone” points at the same blocks on disk as the original source data. Once a block is written to (from either the clone or the origin), the Copy-On-Write part of ZFS takes care of the space allocation. Data that doesn't change will not be duplicated!

You can therefore safely use the traditional methods for estimating your disk space requirements and rest assured that you will in practice need less than that.

Option 2: Another ZFS pool, other than the one that you boot from, may have free space, or you may want to move to another pool on separate disks for any other reason. When you explicitly specify a ZFS pool different from the source pool, Live Upgrade will copy the contents in stead of cloning. Assuming a target ZFS pool name of “NEWPOOL, the command would be:

# lucreate -c Sol10u5 -n Sol10u8 -p NEWPOOL

As before the active BE is probed to determine the source for live upgrade.

Note that I as a habit use upper-case names for my ZFS pool names. That is because I like them so much. It is also because it makes them stand out as pool names in command outputs, particularly df and mount!

Not really a separate option as such, but worth mentioning here: With ZFS boot being new, people often want to migrate from UFS to a ZFS root – The commands are the same as when migrating from one ZFS pool to another – once again the source is automatically based on the active BE and only the destination is specified.

You must be running (or going to upgrade to) at least Solaris 10 release 10/08 (S10 update 6) in order to utilize ZFS root. If running Solaris earlier than Update 2 then it will not be possible to use ZFS since the kernel must also support ZFS, not only the Live upgrade tools.

In the below example I create the new ZFS root pool using the drive c0t0d1:

# zpool create RPOOL c0t0d1s1
# lucreate -c Sol10u5 -n Sol10u8 -p RPOOL

The lucreate command will copy the active BE into the new ZFS pool. Note: You don't have to actually upgrade. Once the copy (create process) completes, run luactivate and reboot to switch over to ZFS.

# luactivate Sol10u8
# init 6

After checking the system, clean up …

# ludelete Sol10u5
# prtvtoc /dev/rdsk/c0t0d1s2 | fmthard -s - /dev/rdsk/c0t0d0s2
# zpool attach RPOOL c0t0d1s1 c0t0d0s1

I want to highlight that I specified partition (slice) numbers above. Generally the recommendation is to gives ZFS access to the “whole disk”, but for booting it is a requirement to specify a slice.

A few extra considerations: The second disk is not automatically bootable, but rather than being redundant I will just link this excellent information

Now that you are on ZFS root you should also configure swap and dump “the ZFS way” - see here

If you choose for whatever reason not to move to ZFS root yet, maybe you are still not running Solaris 10 update 6 or later that supports ZFS booting, then you still have some options.

Option 3: Check whether you have free, unallocated space on any disks. The “prtvtoc” command will show areas on disk that are not part of any partition, as in the below example:

# prtvtoc /dev/dsk/c1t1d0s2

If any space is not allocated to a partition, there will be a section in the output before the partition table like this

* Unallocated space: 
*     First  Sector    Last 
*     Sector Count    Sector 
*     2097600 4293395776  526079 
*     8835840 4288229056  2097599

If so, create a slice to overlay those specific cylinders (I do this carefully by hand on a case-by-case basis), and then use the newly created slice.

Note: A Solaris partition is called a disk-slice by the disk management tools. On X86, there is a separate concept called a partition, which is a BIOS partitioning. In this situation, all Solaris disk slices exist inside the Solaris tagged partition.

Option 4: If you do not have unallocated space on any disks, you might still have unused slices... Be careful though – unmounted is not the same as unused! Check with your DBAs whether they are using any raw partitions, ie partitions without a mounted file system. I've also seen cases where people unmount their backup file systems as a “security” measure, though the value in that is debatable.

It may be worth mentioning that when looking for space, you can use any disks in the system, it does not have to just be one of the first two disks, or even an internal disk.

To specify a slice to use as target you use the -m option of lucreate, eg

# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/dsk/c0t0d1s6:ufs

The above command will use /dev/dsk/c0t0d1s6 as the target for the root file system on the new BE.

You can also use SVM meta-devices. For example

# metainit d106 c0t0d1s6
# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/md/dsk/d106:ufs

Or on a mirror (assuming two free slices)

# metainit -f d11 1 1 c0t0d0s0
# metainit -f d12 1 1 c0t1d0s0
# metainit d10 -m d11 d12
# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/md/dsk/d10:ufs

Note that the traditional “metaroot” step is left to Live-upgrade to handle, and that the mirror volume in the example is created without syncing because the slices are both blank! You could always rather attach the second sub-mirror in the traditional way just to be safe.

Option 5: Split /var and the root file systems. If you have two slices somewhere but neither is large enough to hold the entire system, this could work. Then after completing the upgrade, you can delete the old BE to free up the original boot disk, and “migrate” back to that. It involves a bit of work, but you would use Live upgrade for this migration, which is exactly the kind of thing that makes Solaris so beautiful.

The commands to split out /var from root would look like this.

# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/md/dsk/d10:ufs -m /var:/dev/md/dsk/d11:ufs

When you compare this with the previous example you will notice there is an extra -m option for /var. Each mount point specified with -m will become a separate file system in the target BE. Adding an extra entry for /usr or any other file system works in same way. To better understand the -m options, think of them as instructions to Live upgrade about how to build the vfstab file for the new BE.

Note that non-critical file systems, eg anything other than root, /var, /usr and /opt are automatically kept separate and considered as shared between BEs.

Option 6: Temporarily deploy a swap partition or slice to use as a root file system. This would work if you have “spare” swap space. Don't scoff - I've many a times seen systems that have swap space configured purely for purposes of saving core dumps. The commands would be

# swap -x /dev/dsk/c0t0d0s0
# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/dsk/c0t0d0s0:ufs

There would be some clean-up work left once everything is done, for example deleting the old BE and creating a new swap partition from that space.

Option 7: A final option is to break an existing SVM mirror. In this case it will not be necessary to copy the file system over to the target, because due to the mirror, it is already there. The meta-device for the sub-mirror is also already there. We will however create a new single-sided SVM Mirrored volume from this sub-mirror for this process.

To do this you specify two things: A new “mirror” volume, as well as the sub-mirror to be detached and then attached to the new mirror volume.

Assuming we have d10 with sub-mirrors d11 and d12, we will create a new mirror volume called d100. We will remove d12 from d10, and attach it to d100. A single lucreate command takes care of all of that:

# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/md/dsk/d100:mirror,ufs -m /:/dev/md/dsk/d12:detach,attach,preserve

To examine the above command: You can see that -m is specified twice, both times for root. The first have the tag or “mirror,ufs” and it creates the new mirror volume. The second have tags “detach,attach,preserve”. Detach: Live upgrade needs to detach it first. Attach: Do not use it directly, in stead attach it to the volume. Preserve: No need to reformat and copy the file system contents.

In stead of breaking a mirror and re-using the sub-mirrors, lucreate can set up the SVM meta-devices, for example:

# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/md/dsk/d100:mirror,ufs -m /:/dev/dsk/c1d0t0s3,d12:attach

Comparing to the previous example you will notice that the device specifier in the second field of the second -m option lists a physical disk slice as well as a name for a new meta-device. You will also notice that the only tag is “attach” because the new device doesn't need to be detached, and can't be “preserved” since it doesn't have any data.

Option 8: If you have your root file systems mirrored with Veritas Volume manager, and there is no other free space large enough to hold a root file system, then I suggest that you manually break the mirror to free up a disk, rather than try to use the vxlu* scripts.

I have not personally had access to a VXVM based system in years but from the rough time many people apparently have, based on the questions I see in forums, I would recommend that you un-encapsulate, perform the upgrade, and then finally then re-encapsulate.

Option 9: If you have some rarely accessed data on disk you may have the option of temporarily moving that onto another system or even a tape in order to free up space. After completing the upgrade you can restore this data.

Option 10: Move 3rd party applications and remove crash dumps, stale log files, old copies of Recommended patch clusters, and the likes, to other disks or file systems. This actually isn't a separate recommendation – it is something you should be doing in any case, in addition to any other options you use. This should be "Option # 0"

These files, if residing in the critical file systems, will be copied unless you expressly exclude them.
With ZFS root and snapshots it is less of an issue – the snapshot doesn't duplicate data until a change is written to a file. This however could create the reverse of the problem: An update to “shared” files that lives in a critical file system, will not be replicated back to the original BE because data in a cloned file system is treated as not shared!

You probably can not exclude your applications or software directories, so instead do this: First move the application directory to any shared file system. Then create a soft-link from the location where the directory used to be, to where you moved it to. I have yet to encounter an application that will not allow itself to be relocated in this way, and can confirm that it works fine for SAP, Oracle applications, Siebel and SAS, as well as many other “infrastructure” software components, like Connect Direct, Control-M, TNG, Netbackup, etc.

A few more notes:

Swap devices are shared between boot environments if you do not specifically handle them. In most cases this default behaviour should be sufficient.
If you have /var as a separate file system, it will be “merged” it into the root file system, unless expressly specified with a separate -m option. This is true for all the critical file systems: root, /var, /usr and /opt
On the other hand, all shareable, non-critical file systems are handled as shared by default. This means they will not be copied, merged, or have any changes done to them, and will be used as is and in place.
To merge a non-critical file system into its parent, use the special “merged” device as as the target in the -m option. For example will merge /home into the root
# lucreate -c Sol10u5 -n Sol10u8 -m /:/dev/md/dsk/d100:ufs -m /home:merged:ufs

In this article I have not really spoken about the actual upgrade that happens after the new BE is created. I've posted about it in the past and in most cases it is already well documented on the web.

Another very interesting subject is what happens during luactivate! I'll leave that for another day! It is a real pity that oracle is depreciating Live upgrade, but it will still be around for a while.

Sun said it first

2012-03-03T11:23:00.002+02:00

It takes the x86-based market years to figure out the truths of what Sun Microsystems have been saying for years, in each case. The latest example is here, I quote from the second paragraph:

There's a category of server applications that can be better served by a lower class of good enough computing, delivering much better power efficiency. Content web servers, similar to what we use at AnandTech, don't present a hugely complex workload but they do see lots of threads and have largely variable performance requirements. SeaMicro's technology reduces power consumption by using lower power CPUs and highly power optimized motherboards.

Doesn't that sound just a little too similar to what Sun have been saying since for ever? I wonder whether anybody else anywhere did as much innovation as Sun did? What a failure to market - almost every cellphone comes out with Java, yet almost nobody knows who created it. That, and Sun's habit of giving everything away for free, is why it exist no more. Good products alone doesn't make a company successful - the company needs to be able to turn those products into a profit.

Maintaining the Linux device driver code base

2011-03-30T12:23:00.003+02:00

After a (sadly) failed attempt to convert my significant other to Linux, I had a discussion with her about why it failed. Root cause.

Her computer works well with Windows, not at all with Linux. The reason is that her laptop will display no better than 800x600 resolution as there is no good SIS671 graphics driver for Linux (and there is for Windows). Nothing recent, functional, supported, viable or workable.

Why isn't there one for Linux?

Because it doesn't make money. Business is the process of converting time into money. Sales people get customers to buy a product or service. Technical people produce the products or deliver the services. Management and administration functions supported and enables the business to operate as a whole. (Or so the theory goes, but that is another story)

And because programmers also need a place to live. And to feed the kids.

There is cost involved – an investment, and there is a price, the return on the investment. A product, in this case a device driver for the graphics processor, needs to be designed, produced and supported. The technical people and the tools they need to do this do not come cheap.

Device drivers for Linux, however, does not make much, if any money for the companies involved. People do not pay for device drivers, rather they (rightly) expect it to be included in the cost of the hardware.

Even closed source Linux drivers are free – the vendor have to cover the costs through the sales of hardware. But the business model is flawed – The cost to deliver the Linux device driver far exceeds the income generated from hardware sales to Linux users. Thus this expense must be subsidised from sales to Windows users.

Unless the Linux user base grows to reach a critical mass, the point where enough Linux users buy the hardware to be able to justify the cost of the driver development and support, the situation will not change.

The above situation is the same, no, actualy worse for other hardware – Webcams, GPS'es, Cell-phones, USB thumb drives, bluetooth hardware, Wi-Fi and network cards, Fingerprint readers and touch-pad input devices. Every single bit of hardware.

The Linux kernel includes almost all device drivers for the hardware because of this situation. It is the only way the Linux community can use most of the consumer hardware available in the world today – that is, by developing the needed device drivers themselves.

As a result Linux supports much much more hardware than Windows does. Windows depends on the driver disks that ships with the hardware because Microsoft does not provide driver software for every bit of hardware out there!

The more you think about it, the more you realise just how unbalanced the situation really is! Microsoft sells its Windows operating system with only basic device drivers included – for proper functionality, features and performance, you need to load the hardware manufacturer's drivers. The hardware manufacturers provide the device drivers because otherwise they would lose the majority of their market – Windows users.

The Linux community, an entity that makes no money, needs to provide device drivers created through donated effort. I am aware of the exceptions, but that does not change the overall picture. The effort to maintain and update the base of device drivers included in the Linux kernel increases as the number of pieces of hardware to be supported increases. In other words: Every time a new piece of hardware appears in the shops.

To add insult to injury, the Linux community locks themselves in with the GPL license, which means they can not, for example, utilise and share effort by other Unix or BSD distributions because the Linux kernel enforces the use of the restrictive GPL.

Even worse, a Linux device driver works only on a specific release of the kernel. This is because the kernel interfaces for device drivers changes, and as a result the device driver needs to be re-compiled for every update, even minor updates, to the kernel. The amount of extra work this would place on hardware manufacturers to ensure that their device driver works on every kernel version is significant, and much more than what is needed for, for example, Windows or Solaris.

The long and the short of it is that to produce and maintain device drivers for Linux is prohibitively expensive, and the market loss as a result of not supporting Linux users is essentially negligible to most hardware manufacturers' bottomline!

Regarding the market share situation: I have long held the belief that through “allowing” us to copy Windows, Bill Gates got the world to using MS Windows. It is what most people grew up with on our computers at home, and what we as a result expected when we entered the workplace. More than just the majority of the work force, today's computer gamer is tomorrow's IT business decision maker.

But there is some light on the horizon: The Wayland Display Server may just give the Linux graphics stack the performance boost it needs to make it a viable gaming platform, which in turn will gain it the adoption of many gamers, and in the long run more market share on the desktop. Now if only Linus would fix the device driver ABIs and APIs to make it that bit easier for hardware manufacturers to support their device driver software on Linux...

There is a lot of fud on the net about how the "deliberately dynamic ABIs" of the Linux kernel makes Linux drivers better maintained, less buggy, etc. Sigh.

Live Upgrade to install the recommended patch cluster on a ZFS snapshot

2011-02-17T15:49:00.005+02:00

Live Upgrade used to require that you find some free slices (partitions) and then fidget with the -R "alternate Root" options to install the patch cluster to an ABE. With ZFS all of those pains have just ... gone away ...

Nowadays Live Upgrade on ZFS don't even copy the installation, instead it automatically clones a snapshot of the boot environment, saving much time and disk space! Even the patch install script is geared towards patching an Alternate Boot Environment!

The patching process involves six steps:

Apply Pre-requisite patches
Create an Alternate Boot Environment
Apply the patch cluster to this ABE
Activate the ABE
Reboot
Cleanup

Note: The system remains online throughout all except the reboot step.

In preparation you uncompress the downloaded patch cluster file. I create a zfs file system and mounted it on /patches, and extracted the cluster in there. Furthermore, you have to read the cluster README file - it contains a "password" needed to install, and information about pre-requisites and gotches. Read the file. This is your job!

The pre-requisites are essentially just patches to the patch-add tools, conveniently included in the Patch Cluster!

Step 1 - Install the pre-requisites for applying the cluster to the ABE

# cd /patches/10_x86_Recommended
# ./installcluster --apply-prereq

Note - If you get an Error due to insufficient space in /var/run, see my previous blog post here!

Step 2 - Create an Alternate boot environment (ABE)

# lucreate -c s10u9 -n s10u9patched -p rpool

Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named <s10u9>.
Creating initial configuration for primary boot environment <s10u9>.
The device </dev/dsk/c1t0d0s0> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <s10u9> PBE Boot Device </dev/dsk/c1t0d0s0>.
Comparing source boot environment <s10u9> file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <s10u9patched>.
Source boot environment is <s10u9>.
Creating boot environment <s10u9patched>.
Cloning file systems from boot environment <s10u9> to create boot environment <s10u9patched>.
<B>Creating snapshot</B> for <rpool/ROOT/s10_0910> on <rpool/ROOT/s10_0910@s10u9patched>.
<B>Creating clone</B> for <rpool/ROOT/s10_0910@s10u9patched> on <rpool/ROOT/s10u9patched>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/s10u9patched>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <s10u9patched> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <s10u9patched> in GRUB menu
Population of boot environment <s10u9patched> successful.
Creation of boot environment <s10u9patched> successful.

There is now an extra boot environment to which we can apply the Patch Cluster. Note - for what it is worth, if you just needed a test environment to play in, you can now luactivate the alternate boot environment and then make any changes to the active system. If the system breaks, all it takes to undo any and all changes is a reboot.

Step 3 - Apply the patch cluster to the BE named s10u9patched.

# cd /patches/10_x86_Recommended
# ./installcluster -B s10u9patched

I am not showing the long and boring output from the installcluster script as this blog post is already far too long. The patching runs for quite a while, plan for at least two hours. Monitor the process and check the log for warnings. Depending on how long it has been since the last patches were applied, some severe patches may be applied which can affect your ability to login after rebooting. Again: READ the README!

Step 4 - Activate the ABE.

# luactivate s10u9patched
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <s10u9>
A Live Upgrade Sync operation will be performed on startup of boot environment <s10u9patched>.

Generating boot-sign for ABE <s10u9patched>
Generating partition and slice information for ABE <s10u9patched>
Copied boot menu from top level dataset.
Generating multiboot menu entries for PBE.
Generating multiboot menu entries for ABE.
Disabling splashimage
Re-enabling splashimage
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

**********************************************************************

The target boot environment has been activated. It will be used when you
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You
MUST USE either the init or the shutdown command when you reboot. If you
do not use either init or shutdown, the system will not boot using the
target BE.

**********************************************************************

In case of a failure while booting to the target BE, the following process
needs to be followed to fallback to the currently working boot environment:

1. Boot from the Solaris failsafe or boot in Single User mode from Solaris
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like
/mnt). You can use the following commands in sequence to mount the BE:

zpool import rpool
zfs inherit -r mountpoint rpool/ROOT/s10_0910
zfs set mountpoint=<mountpointName> rpool/ROOT/s10_0910
zfs mount rpool/ROOT/s10_0910

3. Run <luactivate> utility with out any arguments from the Parent boot
environment root slice, as shown below:

<mountpointName>/sbin/luactivate

4. luactivate, activates the previous working boot environment and
indicates the result.

5. Exit Single User mode and reboot the machine.

**********************************************************************

Modifying boot archive service
Propagating findroot GRUB for menu conversion.
File </etc/lu/installgrub.findroot> propagation successful
File </etc/lu/stage1.findroot> propagation successful
File </etc/lu/stage2.findroot> propagation successful
File </etc/lu/GRUB_capability> propagation successful
Deleting stale GRUB loader from all BEs.
File </etc/lu/installgrub.latest> deletion successful
File </etc/lu/stage1.latest> deletion successful
File </etc/lu/stage2.latest> deletion successful
Activation of boot environment <s10u9patched> successful.

# lustatus
Boot Environment           Is       Active Active    Can    Copy
Name                       Complete Now    On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
s10u9                      yes      no     no        yes    -
s10u9patched               yes      yes    yes       no     -

Carefully take note of the details on how to recover from a failure. Making a hard-copy of this is not a bad idea! Take note that you have to use either init or shutdown to effect the reboot, as the other commands will circumvent some of the delayed action scripts! Hence ...

Step 5 - Reboot using shutdown or init ...

# init 6

Monitor the boot-up sequence. A few handy commands while you are performing the upgrade, includes:

# lustatus
# bootadm list-menu
# zfs list -t all

You will eventually (after confirming that everything works as expected) want to free up the disk space held by the snapshots. The first command cleans up the redundant Live Upgrade entries as well as the relevant ZFS snapshot storage! The second is to remove the temporary ZFS file system used for the patching.

Step 6 - Cleanup

# ludelete s10u9
# zfs destroy rpool/patches

Again no worries about where the space comes from. ZFS simply manages it! Live Upgrade takes care of your grub boot menu and gives you clear instructions on how to recover it anything goes wrong.

Adding a ZFS zvol for extra swap space

2011-02-17T11:27:00.005+02:00

ZFS sometimes truly takes the think work out of allocating and managing space on your file systems. But only sometimes.

Many operations on Solaris, OpenSolaris and Indiana will cause you to run into swap space issues. For example using the new Solaris 10 VirtualBox appliance, you will get the following message when you try to install the Recommended Patch Cluster:

Insufficient space available in /var/run to complete installation of this patch
set. On supported configurations, /var/run is a tmpfs filesystem resident in
swap. Additional free swap is required to proceed applying further patches. To
increase the available free swap, either add new storage resources to swap
pool, or reboot the system. This script may then be rerun to continue
installation of the patch set.

This is fixed easily enough by adding more swap space, like this:

# zfs create -V 1GB -b $(pagesize) rpool/swap2
# zfs set refreservation=1GB rpool/swap2
# swap -a /dev/zvol/dsk/rpool/swap2
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 181,2       8 1048568 1048568
/dev/zvol/dsk/rpool/swap2 181,1       8 2097144 2097144

Setting the reservation is important, particularly if you plan on making the change permanent, eg by adding the new zvol as a swap entry in /etc/vfstab. ZFS does not reserve the space for swapping otherwise, so the swap system may think there is space which isn't actually there if you don't do this.

The -b option sets the volblocksize to improve swap performance by aligning the volume I/O units on disk to the size of the host architecture memory page size (4 KB on x86 systems and 8KB on SPARC, as reported by the pagesize command.)

If this is just temporary, then cleaning up afterwards is just as easy:

# swap -d /dev/zvol/dsk/rpool/swap2
# zfs destroy rpool/swap2

It is also possible to grow the existing swap volume. To do so, set a new size and refreservation for the existing volume like this:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=2g rpool/swap
# zfs set refreservation=2g rpool/swap
# swap -a /dev/zvol/dsk/rpool/swap

And finally, it is possible to do the above without unmounting/remounting the swap device, by using the following "trick":

# zfs set volsize=2g rpool/swap
# zfs set refreservation=2g rpool/swap
# swap -l | awk '/rpool.swap/ {print $3+$4}'|read OFFSET
# env NOINUSE_CHECK=1 swap -a /dev/zvol/dsk/rpool/swap $OFFSET

The above will calculate the offset in the swap device and add a new "device" to the list of swap devices. This will automatically use the added space in the zvol. The Offset will be shown as the "swaplo" value in swap -l output. Multiple swap devices on the same physical media is not ideal, but on the next reboot (or by deleting and re-adding the swap device) the system will recognise the full size of the volume.

No worries about where the space comes from. ZFS just allocates it! The flip side of the coin is that once you have quotas, reservations, allocations, indirect allocations such as from snapshots, figuring out where your space has gone can become quite tricky! I'll blog about this some time!

Useless Performance Comparisons

2010-12-06T17:52:00.003+02:00

The point of performance comparisons or benchmark articles has to be purely sensational. By far the most of these appear to have little value other than attracting less educated readers to the relevant websites.

In a recent article Michael Larabel of Phoronix reports on the relative performance of various file systems under Linux, specifically comparing the traditional Linux file systems to the new (not yet quite available) native ZFS module. According to the article ZFS performs slower than the other Linux file systems in most of the tests, but I have a number of issues with both how the testing was done and with how the article was written.

Solaris 11 Express should have been included in the test, and the results for OpenIndiana should be shown for all tests. It is crucial that report include other system metrics such as CPU utilization during the test runs.

I also have some even more serious gripes. In particular the blanket statement that some unspecified “subset” of the tests were performed on both a standard SATA hard drive and the SSD drive, but that the results were “proportionally” the same – does not make sense as some tests are more seek latency sensitive than others, and some file systems hide these latencies better than others.

Another serious gripe is that there is no feature comparison. More complex software has more work to do, and one would expect some trade-offs.

Even worse: two of ZFS’s strengths were eliminated by the way the testing was done. Firstly when ZFS is given a “whole disk” as is recommended in the ZFS best practices (as opposed to being given just a partition) it will safely enable the disk’s write cache. It only does this if it knows that there are no other file systems on the disk, i.e when ZFS is in control of the whole disk. Secondly ZFS manages many disks very efficiently, particularly as far as is concerned allocating space: ZFS performance doesn't come into its own right on a single disk system!

Importantly, and especially so since this is very much a beta version of a port of a mature and stable product, we need to understand which of ZFS's features are present, different and/or missing compared to the mature product. For example some of ZFS’s biggest performance inhibitors under FUSE is that it is limited to a single-threaded ioctl (Ed: Apparently this is fixed in ZFS for Linux 0.6.9, but I am unable to tell whether this is the version Phoronix tested) - and not having access to the disk devices at a low level. The KQ Infotech website lists some missing features, particularly interesting is the missing Linux async I/O support. Furthermore the KQ Infotech FAQ states that Direct IO falls back to buffered read/write functions and that missing kernel APIs are being emmulated though the "Solaris Porting Layer".

A quick search highlights some serious known issues, such as the Linux VFS Cache and ZFS ARC cache copy duplication bug, a bug which heavily impacts on performance.

More information about missing features can be found on the LLNL issue tracker page.

If nothing else, the article should mention the fact that there are known severe performance issues and feature incompleteness with the Linux native ZFS module! The way in which Linux allocates and manages virtual address space is inefficient (don't take my word for it, see this and this), requiring expensive workarounds.

Besides all of this my real, main gripe is about this kind of article in general. The common practice of testing everything with “default installation settings” implies that nothing else needs to be done - however when you want the absolute best possible performance out of something, you need to tune it for the specific workload and conditions. In the case of the article in question, the statement reads “All file-systems were tested with their default mount options”, and no other information is given, such as whether the disk was partitioned, whether the different subject file systems where mounted at the same time, what disk the system was booted from and whether the operating system was running with part of the disk hosting the tested file system mounted as its root. We don’t even know whether the author read the ZFS Best Practices Guide.

It can be argued that the average person will not tune the system, or in this case the file system, for one specific workload because their workstation should be an all-round performer, but you should still comply with the best practices recommendations from the vendors, especially if performance is one of your main criteria.

I don’t know whether using defaults is ever acceptable in this kind of article. My issue stems from how these articles are written in a way that suggests that performance is the only or at least the most important factor in choosing an operating system, or file system, of graphics card, or CPU or whatever the subject is. If that were true then at least the system should be tuned to make the most of each of the subject candidates, whether these are hardware or software parts being tested and compared to one another. This tuning is often done by disabling features, configuring the relevant options, and usually to get it right you would need to have someone who is a performance expert on that piece of software or hardware to optimize it for each test.

Specific hardware (or software) often favor one or the other of the entrants. An optimized, feature poor system will outperform a complex, feature rich system on limited hardware. Making the best use of the available hardware might mean different implementation choices when optimizing for performance rather than for functionality or reliability. ZFS in particular comes into its own right, both in terms of features and performance, when it has underneath it a lot of hardware – RAM, disk controllers, and as many dedicated disk drives as possible. The other file systems have likely reached their performance limit on the limited hardware on which the testing was done. Linux is particularly aimed at the non-NUMA, single-core, Single hard drive, single user environment. Solaris, and ZFS, was developed in a company where single user workstations were an almost non-existing target, the real target of course being the large servers of the big corporates.

As documented in the ZFS Evil Tuning Guide, many tuning options exist. One could turn off ZFS check-sum calculations, limit the ARC cache sizes, set the SSD disk as a cache or log device for the SATA disk, and set the pool to cache only meta data, to mention a few. Looking at the hardware available in the Phoronix article, the choices would depend on the specific test – in another test one might stripe between the SATA disk and the SSD disk, in another you might choose to mirror across the two.

The other file system candidates might have different recommendations in terms of how to optimize for performance.

I realize that the functionality would be affected by such tuning, but the article doesn’t look at functionality, or even usability for that matter. ZFS provides good reliability and data integrity, but only in its default configuration, with data check-summing turned on. The data protection levels and usable space in each test might be different, but that again is a function of which features are used and not the subject of the article, not even mentioned anywhere in the article.

As a point in case for the argument about functionality, once needs to consider all that ZFS is doing in addition to being a POSIX compliant file system. It replaces the volume manager. It adds data integrity checking through check-summing. It manages space allocation, including space for file systems, meta-data, snapshots, and ZVOLs (virtual devices created out of a ZFS pool) automatically. Usage can be controlled by means of a complete set of reservation and quota options. Changing settings, such as turning on encryption, the number of copies of data to be kept, whether to do check-summing, etc is dynamic. There is much more as Google will tell.

And just to add insult to injury, the article goes and pits XFS against ZFS, ignoring the many severe reliability issues present with XFS, such as the often reported data corruption under heavy load and severe file system corruption when losing power.

I would really like to see a performance competition one day. The details of how the testing will be done will be given out in advance to allow the teams to research it. Each team is then given access to the same budget from which to buy and build their own system to enter into the competition. Their performance experts then set up and build the systems, and install the software and tune it for the tests on the specific hardware they have available. One team might buy a system with more CPUs while another might buy a system with more disks and SCSI controllers, but the test is fair (barring my observation about how feature poor systems will always perform better on a low-budget system) because the teams each solves the same problem with the same budget. The teams submit to the judges their ready systems to run the performance test scripts and publish their configuration details in a how-to guide. To eliminate cheating, an independent group will try to duplicate the team’s test results using the guide.

I think this would make a fun event for a LAN party – any sponsors interested?

Lost Dog!

2009-05-10T19:20:00.005+02:00

Otto, our Dachhund cross got lost yesterday. If someone found such a dog in the area near the Stellenberg High School and did a Google search I hope that they will hit on this page. Otto is a little brown dog with realy big ears. The dog is my son, Francois' best friend, so it would be realy terrible if we never were to find it again. For the record, we live in Amanda Glen, but the dog could easily walk into the Sonstraal Heights or Stellenberg area. If you saw this dog, please call Johan at 021 910 7160 or Reinette at 021 976 3453. Thank you!

ZFS user quotas available in SNV build 114

2009-04-26T17:02:00.001+02:00

I noted, as per Chris Gerhard's Weblog that user and group Quotas on ZFS will be available soon - the fix to bug ID 6501037 is currently slated for inclusion in ON build 114.

Once this becomes available I will have one fewer item on my list of features missing from ZFS.

Currently to limit users' consumption the workaround documented here is to provide each user with a dedicated directory on which another dataset is mounted and a quota is set. This implies that the user can only create or write to files in that specific directory. To track and limit a user's total usage across an entire ZFS pool requires User quotas - ditto for consumption by group.

According to this post by Matty the feature is implemented in a way which enforces the rule "tardily", that is it is a little "late", and also mentions that translated SIDs (eg when the directory is shared via SMB) are supported.

The PSARC/2009/204/ document here provides details of how the quotas is implemented. Two new zfs subcommands, namely zfs userspace and zfs groupspace reports the consumption, and control is by means of a set of new properties on ZFS file system datasets.

This amounts to good news all around. Maybe I should start tracking bug IDs for all of the items on my feature wish-list!

Oracle becomes the second "IBM"

2009-04-23T19:30:00.002+02:00

As promised, my thoughts on the merger with Oracle.

Disclaimer: We know that the deal has not yet been finalized yet, and these are my personal opinions and thoughts on the matter.

I have read many forum posts about the merger, many of them very negative about the whole deal. Whether it is people fearing that Oracle will kill off this or that product, or people who feel that it is good riddance to Sun, these all indicate a serious level of misunderstanding about the IT industry as a whole, about what Open Source software is, and mostly what Sun is about.

On a personal level the merger scares me: Change is always stressful, and we love the culture at Sun. Sun's culture of allowing the Engineers a virtual free hand in designing products is what drives the innovation and new features.

Sun's products really represent a level of innovation and quality that is hard to match. The cost of Sun's servers and storage products are often said to be too high, but when doing a like for like comparison, Sun products at the same price as those of the competition have better performance, features, power consumption, rack density, upgradability, investment protection, manageability and build quality. I know many of Sun's products well because of the way we work in the SSA (Sub Saharan African) region - The engineers here all support all of Sun products, bar none.

A little bit more about that: The engineers in this region must handle calls, ie analyze, determine fault cause, and often implement the solution (Though this may change with a split in the team having been proposed). The products we support include software and hardware, everything from NAS gateways to Cluster software, and includes many products that are virtually unknown outside of their niche markets.

What enables us to support such a wide variety of products is the quality of the products: They work as documented, and the documentation is available. In addition we have access to the engineering teams and interest groups for discussing unusual problems.

What I am trying to get at is that I have personally dealt with a wide variety of Sun's products, and everywhere I look I see supportability through enterprise level maturity. People who bash the products have had a single bad experience, and unfortunately it is human nature to base opinions on your bad experiences, and not notice when things go well.

Sun's product line includes Storagetek's tape libraries and VTL / VSM mainframe products. It includes the Fujitsu based M9000-64. It includes the Constellation blades and switches, and little servers like the T2000 and even many smaller, though older V210s and V240s that are still being used. In January I had to replace an EEPROM chip on an Ultra 5 on a scientific vessel in the Cape Town harbor!

Sun also have a major investment in SPARC processors. No, they are not the fastest number crunchers, in fact the UltraSPARC processors are quite slow when compared to the fastest processors from the competition. But scaling to hundreds of CPUs is mature technology in the SPARC camp. Multi-core CPU technology: Mature. Multi-terabyte RAM in a system: Mature technology. NUMA? Mature technology. 64-bit processors? Those came out, what, 12 years ago. Systems with over 700 GB/sec internal bus bandwidth? We got it. Adding memory or CPUs to a running system. Mature, and available for 10 years already. You can even remove those components and repair or replace them without stopping the OS, though it does require that you configure the server correctly.

IBM in comparison have a slightly wider footprint in the server hardware arena: Their systems include laptops and Mainframes, and essentially everything in between. But their software product set is not as broad as that of Sun, which includes products like the Luster file system, SunRay software, Java, StarOffice and OpenSolaris, etc etc etc.

What happens when Oracle suddenly attaches all of Sun's products to its own portfolio ... ?

In the enterprise market there are realy only two databases: DB2 and Oracle. I am well aware that PostgreSQL and MySQL and YourFavourtiteDB also have a meaningful place in the market, but those are all to a large extent "alternatives to" using Oracle or DB2.

At present Oracle is not a huge competitor to IBM, except for the Database itself. But when Oracle suddenly adds all of Sun's products to its arsenal, it turns into a different beast. Oracle becomes the second "IBM".

I must say that Oracle is unlikely to kill of many products. They may sell a few products, and I think it would be interesting to see what goes. Funding of some products may stop, but that does not need to mean the end of Open Source products.

One of my biggest gripes with Sun's uninformed critics is how they in the same forum post complain that Oracle will kill of their favourite open source application, and right after that complains about how Sun's products aren't open enough.

Listen to me: The mere fact that you worry about OpenOffice or MySQL or whatever dispearing means you admit what a large contribution Sun is making to your world.

I don't know exactly what the merger will mean for the IT industry as a whole. If Oracle changes the Sun culture to disallow the engineers from being innovative, we will see some of the competition in the market disappear over time. If Oracle sells off some products, they may get new life in another stable or they may disappear, we don't know. But every product that does disappear is a sad case and will be mourned.

If you don't think Sun's products are good, it just shows how little you know. I hope that Oracle will give new life and funding to Sun's R&D. If the Sun culture disapears, I will blame it on bad marketting that failed to turn good products into income, which ultimately resulted in Sun's demise ... but before I make any judgement calls prematurely, let's rather wait and see what happens. Oh and yes, I really do love Sun's products, especially Solaris and the servers.

Voting day in the South African National Election, 2009

2009-04-22T17:26:00.002+02:00

People who know me know that I am strongly apolitical. The problems we are facing in South Africa, especially the living conditions affecting most South Africans, will not be solved by any political party. They simply are not motivated or even enabled to fix things in a satisfactory manner. In short, I believe that it makes no difference who becomes our next Government.

However, like in every one of the past five national elections (since I became eligible to vote), I today cast my vote for an opposition party because I also believe in a multi-partite government: No one party should be allowed to rule unchecked and uncontested. In much the same way that competition is necessary in any industry, opposition serves as a check to keep the ruling party honest: An only incumbent gets a default Carte Blanche, even in a democracy. Very few people are disciplined enough to remain altruistic throughout a dictatorship, if ever they were - I certainly don't believe any of the people on our existing political radar have this ability, regardless of their good intentions.

Strangely this touches on the Sun/Oracle merger, one which many fear would kill of a number of products (and thus the competition) which do not fit in with Oracle's current business model. I'll post on that tomorrow after reading some of the Oracle employees' views on the matter.

Neat way to prevent multiple instances of a script

2008-11-07T12:49:00.004+02:00

Sometimes you need to ensure that you can never have more than one instance of a script running at the same time. This is especially important with scripts that modifies files, and if the script runs for longer than a fraction of a second it becomes more critical. Also if you have many people administrating a system, it becomes more important to ensure they don't step on each other's toes.

Basically the problem is known as the multiple writers problem, and it is solved by something called "semaphores". "Semaphores" is a feature implemented in the kernel with the purpose of providing a way to guarantee that a piece of code be made "mutually exclusive". I don't want to go into this, it is already properly explained on many websites, but have a look at the Wikipedia article if you are interested in the topic.

One technique to get around the problem of multiple instances of a script is to use the existence of a specific file somewhere to flag other instances of the script that there is already a running instance. The "touch" command does not complain about existing files, so you need to to check first whether the file exists already, exit if it does, and create it otherwise. For example

if [ -f /tmp/already_running ] 
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file /tmp/already_running and try again.
   exit 1
else
   touch /tmp/already_running
fi
....

Near the end of this script it is common to delete the file for the next use. This, however, is not the best solution.

If two instances of the script were started at nearly the same instant, what could happen is that instance 1 checks the file, finds it does not exist, but then gets kicked off the CPU so that instance 2 can run. Instance 2 then checks the lock file and also sees it is OK to continue, and then creates the lock. Instance 1 eventually gets CPU time again and, having already previously checked the lock file, believes it is safe to continue running.

This is known as a race condition, and is by definition what Semaphores are meant to prevent. But semaphores are not easily accessible in scripts, or so it might seem.

Now I know you are asking "but what is the chance of such a precice timing of sceduled cpu time to cause this kind of race condition". Yes, the chances are probably low, especially if you are the only person using a specific script. However there is a proper way of checking that we are the only instance running, and it is even simpler to implement thatn the check-file-touch-file method!

The secret lies in the mkdir command.

if ! mkdir /tmp/already_running
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file /tmp/already_running and try again.
   exit 1
fi
....

mkdir will automatically use the kernel built-in semaphores during the actual process of creating the entry in the file system. It will fail if the directory exists, and on succesfull return, the lock will be in place already, so no extra commands are needed to complete the mutex locking process.

The second part of this is to automate the release of the lock when the script exists. Typically you want the lock to be released even if someone kills the script, press Ctrl-C, or if it terminates normally or on an error.

This is done by means of a EXIT trap, and the format of using traps in bourne shell variants is:

trap "do-something-here" EXIT

This trap must be set AFTER obtaining the lock, otherwise a second instance of the script will "inadvertendly" remove the lock obtained by the first instance of the script (because the new instance will basically remove the lock which it was not able to obtain when it exists on not being unable to obtain the lock.

You obviously don't have to have an if-then-fi to print a message to the user - if you are the only person using a script, you can simplify the checking of the lock as follow:

MUTEX_LOCK=/tmp/myscript_already_running
mkdir $MUTEX_LOCK || exit 1
trap "rmdir $MUTEX_LOCK" EXIT

With the above you will simply have an error message from mkdir which you need to interpret as "the script is already running", eg:

mkdir: Failed to make directory "/tmp/myscript_already_running"; File exists

Using this technique, a whole script might look like this:

#!/bin/ksh
# This is myscript v1.0.

#Set up the running environment
MUTEX_LOCK=/tmp/myscript_already_running
...

if ! mkdir $MUTEX_LOCK
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file $MUTEX_LOCK and try again.
   exit 1
fi
trap "rmdir $MUTEX_LOCK" EXIT
...

# Doing work which requires only one instance of the script to be running
...

# THE END

Note there is no "remove lock" statement at the end of the script. This is handled by the trap, which executes on any exit, except of course a kill -9.

Using a kill -9 should in any case only ever be used as a last resort, because it does not allow the program to clean up after itself.

Why X-windows is back-to-front

2008-10-21T22:00:00.003+02:00

This is a very basic introduction to the X-windows protocol, with the purpose of explaining why the server runs on the workstation and the client runs on the server.

Actually the naming is the right way round. Per definition clients initiate connections and servers accept connections. Without trying to be pedantic or philosophical about the difference between a server and a workstation, lets just say that for the purpose of this discussion your desktop machine is the workstation, and the server is some system "service" applications, files, printers, etc from your computer room. When you start an X-windows application on the server, it needs to open a window somewhere. In this case, that somewhere is the screen of your workstation. The X-windows application, really the X-client, will open a TCP/IP connection to port 6000 on your workstation, and then via this session it will send "instructions" to the X-server on how to draw its window.

Clearly something needs to be running on your workstation to accept a connection on port 6000. This piece of software is called the X-server, and such programs are available for most if not all operating systems. In particular it is available for MS Windows and Mac-OS, and Linux and Solaris includes X-server software in the form of Xfree86, Xorg and Xsun.

The X server actually listens on port 6000+N where N is the "Display" or "instance" number. Thus, the first server is Display 0, and listens on TCP/IP port 6000. A second Display would listen on port 6000+1 = 6001, and so on, though having more than one is not particularly common.

There are a few things which seemingly complicates this matter. The first is that in many situations the X-server and the X-client runs on the same system. Essentially this is true for all X applications used on a "local workstation". But the rule still holds - port 6000 listens and accepts connections from the clients running locally. X-clients, which can be anything from Firefox to Gnome-terminal, get a DISPLAY environment variable which tells them to connect to "localhost:0.0"

The second factor is how the application is started. It is possible to walk to the server, enter the command to set the DISPLAY variable (to point to the IP address of your workstation, then start the X-client application, and when you return to your workstation you will find that it is showing the X-application (assuming that the X-server on your desktop is running and accepting connections). However it is much more convenient and more common to log into the server remotely and then set the DISPLAY variable and then start the X-client.

If you use a strategy such as the above often it quickly becomes too much effort and a way to automate the DISPLAY environment variable setting becomes neccesary. Many X-server programs include a few "automation" settings, which often includes connecting to the remote server via rsh, telnet or ssh, logging in, setting the DISPLAY variable, and then starting an X-client program such as a terminal.

In addition SSH has got some special features whereby it can tunnel the X-server setting back to your PC over the encrypted session. This is particularly handy when you have a firewall blocking incomming connections on port 6000. When starting your SSH client up to do this, the client will ask the SSH server to start listening on port 6010, as if it is an X-server for Display nr 10. The SSH server does this (if it is configured to allow this kind of connection) and then starts the shell, usually cleverly setting the DISPLAY value to localhost:10.0. Note that "localhost" refers to the server itself, thus connections to port 6010 will be picked up by the SSH daemon on that server. When such a connection is made, eg by starting an X-client from within the SSH sessions, the SSH daemon on the server accepts the connection and it knows that the specific port is associated with your SSH tunnel, so it will forward the X-windows data back to your workstation via the existing SSH tunnel. The SSH client on your workstation distinguishes the X-windows data from other data along the channel, and makes a connection to the real X-server on the workstation, and forwards the un-encrypted packets received to the local X-server.

So in summary, the X-server runs on the workstation where the display is rendered. The X-client establishes the TCP/IP session to this X-server based on the value of the DIPSLAY setting. In normal situations you will observe the connection as incomming back towards yourself, but usually we associate the "server" of a client-server application with the software running on the remote machine (relatively to where we are sitting), but in this case it is obviously back-to-front.

Note: This brief introduction glosses over the many complex issues and details. Desktop Environments, Display Managers, Composting, Direct Rendering, the Driver model, connections via other than TCP/IP protocols, etc. For a slightly more complete though still very digestible introduction, see the Wikipedia Article on the X-Windowing System. For all the information you can handle head over to the official steering group web site at http://x.org

Making the most of Solaris Man pages

2008-10-18T18:42:00.010+02:00

Solaris man pages (manual pages) are well written, consistent, complete, and generally a great source of information. Here are a few tips to help you get the most out of them. Of course this applies to all Solaris derivatives, including proper Solaris, OpenSolaris and other Open Solaris distributions like Solaris Express, Belenix and Nexenta. It even applies to other Unix flavors like BSD and AIX, and Unix derivatives like Linux.

The simplest (and most obvious) way to use the manual pages are to enter "man command" and then read the entire page.

My first hint however is to change the "PAGER" environment variable. I suggest you set it like this in your .profile file:

PAGER="less -iMsq"; export PAGER

(or for csh and its family members, use setenv PAGER "less -iMsq")

The reasons for this are

a) "less" supports more useful options than does "more". In particular, it supports highlighting, as well as all of the below!

b) The "-i" causes less to ignore case in searches. A definite advantage because you can simply enter /user and it will find the string "user" even when capitalized for use at the start of a sentence.

c) The -M causes "less" to show a more verbose status at the bottom of the screen.

d) -q ... Stop irritating me.

e) -s to "squeeze" blank lines (because otherwise less will "format" pages with blank spaces, as if it is a printed page)

f) You can return to the first line (top) of the page by taping "1" followed by "G"

g) less is more.

The PAGER environment variable is automatically used by the man command, so you can now test it and will find that you can scroll up and down in man pages. To find a word in the man page, tap the forward-slash / followed by the string to search for, and the word wherever it is found will be highlighted in inverse text.

Note: less does have some limitations over more, but these do not apply to viewing man pages. For the curious, the limitation is in the handling of control characters in the input file. In particular, more does a better job of displaying "captured" sessions recorded from a system console or from a shell using the "script" command. Regardless, I solve these by reading capture files using cat -nvet | less -iMq (Note - I like seeing line numbers)

Second Hint: Generate the Manual Pages keyword index database (The so-called windex or apropos information). To do this, you need to run the following command once.

$ catman -w

This will run for a while as it generates and stores the man page keyword index for future use.

Once it completes, the man command's -k option and the "whatis" command will work. This allows you to find what you need much easier. For example to find all man pages for tools, commands and drivers related to Wifi and Wireless networking, you can use

$ man -k wifi wireless

Third hint: Some man pages are not automatically searched or found. In older (proper) Solaris releases, the man command has got a "default list" of locations where to find man pages. In Recent Open Solaris releases, man constructs a list of locations to search. The man manual page states:

  Search Path
     Before searching for a given name, man constructs a list  of
     candidate directories and sections. man searches for name in
     the directories specified by the MANPATH  environment  vari-
     able.

     In the absence of MANPATH, man constructs  its  search  path
     based  upon the PATH environment variable, primarily by sub-
     stituting man for the last component of  the  PATH  element.
     Special  provisions  are added to account for unique charac-
     teristics  of   directories   such   as   /sbin,   /usr/ucb,
     /usr/xpg4/bin, and others. If the file argument contains a /
     character, the dirname portion of the argument  is  used  in
     place of PATH elements to construct the search path.

Note: The (currently apparently undocumented) -p option to man will show the effective man search path.

As implied, you can set in MANPATH a list of directories containing man pages you use regularly. It is also possible to overide the man search path using the man command's -M option. This is particularly true for packages that install under /usr/local, /opt/SUNW...., etc. So to view the man page for "explorer" you would run:

man -M /opt/SUNWexplo/man explorer

(Assuming you have SUNWexplorer installed)

Finally, it helps to understand the structure of a man page. The individual manual pages are divided up into a common set of sections, and knowing them you will often skip straight to a specific (sub)section in the page when looking for specific information.

Some of these subsections are optional, but the names used are consistent. Here is a quick summary of some important sections.

NAME	What this page is about. This is the "whatis" information.
SYNOPSIS	The summary, eg how to use a command.
DESCRIPTION	A more complete description of what this component is.
OPTIONS, OPERANDS and USAGE	These section details how a command is used, eg what command-line options are available for a specific command.
ENVIRONMENT VARIABLES	As the name suggests, this section details Environment variables which affects the behavior of the command.
EXIT STATUS	As the name implies, it explains the possible exit status values. Useful to know that the section exists though.
FILES	A list of files which are relevant to the command or subsystem. In particular here you will find out where a program keeps its configuration files. For a good example, see man dumpadm.
EXAMPLES	Often a great place to skip to when a manual page is complicated.
ATTRIBUTES	An often overlooked section, it explains the status of the item. For commands it tells you which packages the files are in. This section of man pages are fully explained in the "attributes" manual page.
SEE ALSO	One of the most useful parts of man pages, this gives you hints about other pages which are related to the same topic.
NOTES	Various Additional notes.

Other sections, such as SECURITY, ERRORS, SUBCOMMANDS, etc exists in some man pages, but the above are probably the most useful sections.

The entire collection of man pages are divided up into manual sections, such as Section 1 - User Commands, Section 1M - Maintenance commands for sys-admins, Section 3C, the C programming reference information and section 7 - information about device drivers. This is not hugely important, except that you need to understand that some pages occur in more than one section. For example "read" or "signal" By default, man will display the first manual page which matches the file name. If you want to see one of the other pages you need to specify the specific manual section explicitly.

For example

$ man -f signal

signal  signal (3c) - simplified signal management for application processes
signal  signal (3ucb) - simplified software signal facilities
signal  signal.h (3head)    - base signals

We can see that file "signal" has got manual entries in sections 3c (C programming reference), 3ucb (UCB/Posix version), and 3head (the Headers section)

Simply entering ||man signal|| will show the man page from "3C". To see the man page in the "3head" section, enter either

$ man -s 3head signal

$ man signal.3head

I often end up using the second form simply because after viewing a man page I often see that there is another related man page from the SEE ALSO section. I then exit the man page, recall the previous command, and append .section to the command line, which is to say that it is just a matter of convenience.

One last tip: There are online versions of the manual pages at http://docs.sun.com.

You don't need to learn every command and every option by heart. Knowing that you have manual pages, and that you can look up the information you need quickly by looking at related commands from SEE ALSO, and from the keyword index using man -k, and then being able to quickly search through man pages for the right section or keyword absolutely WILL make you a clever, more efficient and over all better sysadmin!

How Solaris disk device names work

2008-07-29T20:28:00.008+02:00

Writing this turned out to be surprisingly difficult as the article kept on growing too long. I tried to be complete but also concise to make this a useful introduction to how Solaris uses disks and on how device naming works.

So to start at the start: Solaris builds a device tree which is persistent across reboots and even across configuration changes. Once a device is found at a specific position on the system bus, and entry is created for the device instance in the device tree, and an instance number is allocated. Predictably, the first instance of a device is zero, (e.g e1000g0) and subsequent instances of device using the same driver gets allocated instance numbers incrementally (e1000g1, e1000g2, etc). The allocated instance number is registered together with the device driver and path to the physical device in the /etc/path_to_inst file.

This specific feature of Solaris is very important in providing stable, predictable behavior across reboots and hardware changes. For disk controllers this is critical as system bootability depends on it!

With Linux, if the first disk in the system is known as /dev/sda, even if it happens to be on the second controller, or have a target number other than zero on that controller. New disk added on the first controller, or on the same controller but with a lower target number, causes the existing disk to move to /dev/sdb, and the new disk then becomes /dev/sda. This used to break systems, causing them to become non-bootable, and was being a general headache. Some methods of dealing with this exists, using unique disk identifiers and device paths based on /dev/disk/by-path, etc.

If a Solaris system is configured initially with all disks attached to the second controlled, the devices will get names starting with c1. Disks added to the first controller later on will have names starting with c0, and the existing disk device names will remain unaffected. If a new controlled is added to the system, it will get a new instance number, e.g c2, and existing disk device names will remain unaffected.

Solaris however composes disk device names (device aliases) of parts which identifies the controller, the target-id, the LUN-id, and finally the slice or partition on the disk.

I will use some examples to explain this. Looking at this device:

$ ls -lL /dev/dsk/c1t*

br--------   1 root     sys       27, 16 Jun  2 16:26 /dev/dsk/c1t0d0p0
br--------   1 root     sys       27, 17 Jun  2 16:26 /dev/dsk/c1t0d0p1
br--------   1 root     sys       27, 18 Jun  2 16:26 /dev/dsk/c1t0d0p2
br--------   1 root     sys       27, 19 Jun  2 16:26 /dev/dsk/c1t0d0p3
br--------   1 root     sys       27, 20 Jun  2 16:26 /dev/dsk/c1t0d0p4
br--------   1 root     sys       27,  0 Jun  2 16:26 /dev/dsk/c1t0d0s0
br--------   1 root     sys       27,  1 Jun  2 16:26 /dev/dsk/c1t0d0s1
br--------   1 root     sys       27, 10 Jun  2 16:26 /dev/dsk/c1t0d0s10
br--------   1 root     sys       27, 11 Jun  2 16:26 /dev/dsk/c1t0d0s11
br--------   1 root     sys       27, 12 Jun  2 16:26 /dev/dsk/c1t0d0s12
br--------   1 root     sys       27, 13 Jun  2 16:26 /dev/dsk/c1t0d0s13
br--------   1 root     sys       27, 14 Jun  2 16:26 /dev/dsk/c1t0d0s14
br--------   1 root     sys       27, 15 Jun  2 16:26 /dev/dsk/c1t0d0s15
br--------   1 root     sys       27,  2 Jun  2 16:26 /dev/dsk/c1t0d0s2
br--------   1 root     sys       27,  3 Jun  2 16:26 /dev/dsk/c1t0d0s3
br--------   1 root     sys       27,  4 Jun  2 16:26 /dev/dsk/c1t0d0s4
br--------   1 root     sys       27,  5 Jun  2 16:26 /dev/dsk/c1t0d0s5
br--------   1 root     sys       27,  6 Jun  2 16:26 /dev/dsk/c1t0d0s6
br--------   1 root     sys       27,  7 Jun  2 16:26 /dev/dsk/c1t0d0s7
br--------   1 root     sys       27,  8 Jun  2 16:26 /dev/dsk/c1t0d0s8
br--------   1 root     sys       27,  9 Jun  2 16:26 /dev/dsk/c1t0d0s9

We notice the following:

1. The entries exist as links under /dev/dsk, pointing to the device node files in the /devices tree. Actually every device has got a second instance under /dev/rdsk. The ones under /dev/dsk are "block" devices, used in a random-access manner, e.g for mounting file systems. The "raw" device links are character devices, used for low-level access functions (such as creating a new file system).

2. The device names all start with c1, indicating controller c1 - so basically all the entries above are on one controller.

3. The next part of the device name is the target-id, indicated by t0. This is determined by the SCSI target-id number set on the device, and not by the order in which disks are discovered. Any new disk added to this controller will have a new unique SCSI target number and so will not affect existing device names.

4. After the target number each disk has got a LUN-id number, in the example d0. This too is determined by the SCSI LUN-id provided by the device. Normal disks on a simple SCSI card all show up as LUN-id 0, but devices like arrays or jbods can present multiple LUNs on a target. (In such devices the target usually indicates the port number on the enclosure)

5. Finally each device identifies a partition or slice on the disk. Devices with names ending with a p# indicates a PC BIOS disk partition (sometimes called an fdisk or primary partition), and names ending with an s# indicates a Solaris slice.

This begs some more explaining. There are five device names ending with p0 through p4. The p0 device, eg c1t0d0p0, indicates the whole disk as seen by the BIOS. The c_t_d_p1 device is the first primary partition, with c_t_d_p2 being the second, etc. These devices represent all four of the allowable primary partitions, and always exists even when the partitions are not used.

In addition there are 16 devices with names ending with s0 though s15. These are Solaris "disk slices", and originate from the way disks are "partitioned" on SPARC systems. Essentially Solaris uses slices much like PCs use partitions - most Solaris disk admin utilities work with disk slices, not with fdisk or BIOS partitions.

The way the "disk" is sliced is stored in the Solaris VTOC, which resides in the first sector of the "disk". In the case of x86 systems, the VTOC exists inside one of the primary partitions, and in fact most disk utilities treats the Solaris partition as the actual disk. Solaris splits up the particular partition into "slices", thus the afore mentioned "disk slices" really refers to slices existing in a partition.

Note that Solaris disk slices are often called disk partitions, so the two can be easily confused - when documentation refers to partitions you need to make sure you understand whether PC BIOS partitions or Solaris Slices are implied. In generally if the documentation applies to SPARC hardware (as well as to x86 hardware), then partitions are Solaris slices (SPARC does not have an equivalent to the PC BIOS partition concept)

Example Disk Layout:

First primary Partition

Second primary Partition

Solaris Type 0xBF or 0x80 Partition
s0	Slice commonly used for root
s1	Slice commonly used for swap
s2	Whole disk (backup or overlap slice)
s3	Custom use slice
s4	Custom use slice
s5	Custom use slice
s6	Custom use slice, commonly /export
s7	Custom use slice
s8	Boot block
s9	Alternates (2 cylinders)
s10	x86 custom use slice
s11	x86 custom use slice
s12	x86 custom use slice
s13	x86 custom use slice
s14	x86 custom use slice
s15	x86 custom use slice

Extended partition
p5	Example: Linux or data partition
p6	Example: Linux or data partition
etc	Example: Linux or data partition

Note that traditionally slice 2 "overlaps" the whole disk, and is commonly referred to as the backup slice, or slightly less commonly, called the overlap slice.

The ability to have slice numbers from 8 to 15 is x86 specific. By default slice 8 covers the area on the disk where the label, vtoc and boot record is stored. Slice 9 covers the area where the "alternates" data is stored - a two-cylinder area used to record information about relocated/errored sectors.

Another example of disk device entries:

$ ls -lL /dev/dsk/c0*

brw-r-----   1 root     sys      102, 16 Jul 14 19:45 /dev/dsk/c0d0p0
brw-r-----   1 root     sys      102, 17 Jul 14 19:45 /dev/dsk/c0d0p1
brw-r-----   1 root     sys      102, 18 Jul 14 19:45 /dev/dsk/c0d0p2
brw-r-----   1 root     sys      102, 19 Jul 14 19:12 /dev/dsk/c0d0p3
brw-r-----   1 root     sys      102, 20 Jul 14 19:45 /dev/dsk/c0d0p4
brw-r-----   1 root     sys      102,  0 Jul 14 19:45 /dev/dsk/c0d0s0
brw-r-----   1 root     sys      102,  1 Jul 14 19:45 /dev/dsk/c0d0s1
...
brw-r-----   1 root     sys      102,  8 Jul 14 19:45 /dev/dsk/c0d0s8
brw-r-----   1 root     sys      102,  9 Jul 14 19:45 /dev/dsk/c0d0s9

The above example is taken form an x86 system. Note the lack of a target number in the device names. This is particular to ATA hard drives on x86 systems. Besides that it works like normal device names I described above.

Below, comparing the block and raw device entries:

$ ls -l /dev/*dsk/c1t0d0p0

lrwxrwxrwx   1 root     root          49 Jun 26 16:22 /dev/dsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q
lrwxrwxrwx   1 root     root          53 Jun  2 16:18 /dev/rdsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q,raw

These look the same, except that the second one points to the raw device node.

For completeness' sake, some utilities used in managing disks:

format

The work-horse, used to perform partitioning (including fdisk partitioning on x86 based systems), analyzing/testing the disk media for defects, tuning advanced SCSI parameters, and generally checking the status and health of disks.

rmformat

Shows information about removable devices, formats media, etc.

prtvtoc

Command-line utility to display information about disk geometry and more importantly, the contents of the VTOC in a human readable format, showing the layout of the Solaris slices on the disk.

fmthard

Write or overwrite a VTOC on a disk. Its input format is compatible with the output produced by prtvtoc, so it is possible to copy the VTOC between two disks by means of a command like this:

prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2

This is obviously not meaningful if the second disk do not have enough space. If the disks are of different sizes, you can use something like this:

prtvtoc /dev/rdsk/c1t0d0s2 | awk '$1 != 2' | fmthard -s - /dev/rdsk/c1t1d0s2

The above awk command will cause the entry for slice 2 to be committed, and fmthard will then maintain the existing entry or, if none exists, create a default one on the target disk.

Also note, as implied above, Solaris slices can (and often do) overlap. Care needs to be taken to not have file systems on slices which overlap other slices.

iostat -En

Show "Error" information about disks, and often very usefull, the firmware revisions and manufacturer's identifier strings.

format -e

This reveals [enhanced|advanced] functionality, such as the cache option on SCSI disks.

format -Mm

Enable debugging output, particularly makes SCSI probe failures non-silent.

cfgadm and luxadm also deserves honorable mention here. These commands manage disk enclosures, detaching and attaching, etc. but are also used in managing some aspects of disks.

luxadm -e port

Show list of FC HBAs.

luxadm can also for example be used to set the beacon LED on individual disks in FCAL enclosures that support this function. The details are somewhat specific to the relevant enclosure.

cfgadm can be used to probe SAN connected subsystems, eg by doing:

cfgadm -c configure c2::XXXXXXXXXXXX

(where XXXXXXXXXXX is the enclosure port WWN, using controller c2)

Hopefully this gives you an idea about how disk device names, controller names, and partitions and slices all relate to one another.

Reading and Writing ISO images using Solaris

2008-07-16T20:31:00.004+02:00

After my recent post on mounting ISO image files I thought I should write a quick article on the other ways of using these files: Reading a disk in to a file and burning a file to a disk. This is not a complete guide on the topic by a long shot, but if you just want the quick start answer, it is here.

If you have an iso9660 CD (or DVD) image file that you want to burn to a disk, you simply use this command:

# cdrw -i filename.iso

This will write the file named filename.iso to the default cd writer device. If working with DVD media, the session is closed (using Disk at once writing), while for CD media Track-at-once writing is used.

To create an ISO image from a disk, use this command:

# readcd dev=/dev/rdsk/c1t0d0s2 f=filename.iso speed=1 retries=20

readcd needs at least the device and the file to be specified. To discover the device, you can use the command "iostat -En" and look for the Writer device, or you can let readcd scan for a device, using a command like this:

# readcd -scanbus

scsibus1:
 1,0,0 100) 'MATSHITA' 'DVD-RAM UJ-841S ' '1.40' Removable CD-ROM
 1,1,0 101) *
 1,2,0 102) *
 1,3,0 103) *
 1,4,0 104) *
 1,5,0 105) *
 1,6,0 106) *
 1,7,0 107) *

The device 1,0,0 can be used directly, or you can convert it to the Solaris naing convention as I did in the example above.

There are of course other ways of doing it, feel free to comment and tell me about your fevourite method for reading to or burning from ISO-image files.

A short guide to the Solaris Loop-back file systems and mounting ISO images

2008-07-09T12:47:00.002+02:00

The Solaris Loop-back file system is a handy bit of software, allowing you to "mount" directories, files and, in particular, CD or DVD image files in ISO-9660 format.

To make it more user friendly, build 91 of ONV introduces the ability to the mount command to automatically create the loop-back devices for ISO images! The Changelog for NV 91 has got the following note:

Issues Resolved: PSARC case 2008/290 : lofi mount BUG/RFE:6384817Need persistent lofi based mounts and direct mount(1m) support for lofi

In older releases, it was necessary to run two commands to mount an ISO image file. The first to set up a virtual device for the ISO image:

# lofiadm -a /shared/Downloads/image.iso
/dev/lofi/1

And then to mount it somewhere:

# mount -F hsfs -o ro /dev/lofi/1 /mnt

Solaris uses hsfs to indicate the "High Sierra File System" driver used to mount ISO-9660 files. Specify "-o ro" to make it Read-only, though that is the default for hsfs file systems, at least lately (I seem to recall that at one point in the past it was mandatory to specify read-only mounting explicitly.

Looking at what has been happening here, we can see the Loop-back device by running lofiadm without any options:

# lofiadm

Block Device             File                           Options
/dev/lofi/1              /shared/Downloads/image.iso -

And the mounted file system:

# df -k /mnt

Filesystem            kbytes    used   avail capacity  Mounted on
/dev/lofi/1          2915052 2915052       0   100%    /mnt

The new feature of the mount command requires a full path to the ISO file (Just like lofiadm does, at any rate it does for now)

# mount -F hsfs -o ro /shared/Downloads/image2.iso /mnt

To check the status:

# df -k /mnt

Filesystem            kbytes    used   avail capacity  Mounted on
/shared/Downloads/image2.iso
                     7781882 7781882       0   100%    /mnt

And when we run lofiadm we see it automatically created a new device, /dev/lofi/2:

# lofiadm

Block Device             File                           Options
/dev/lofi/1              /shared/Downloads/image.iso -
/dev/lofi/2              /shared/Downloads/image2.iso -

Some of the other uses of the Loop-back file system:

You can mount any directory on any other directory:

# mkdir /mnt1
# mount -F lofs -o ro /usr/spool/print /mnt2

Note the use of lofs as the file system "type". This is a bit like a hard-link to a directory, and it can exist across file systems. These can be read-write or read-only.

You can also mount any individual file onto another file:

# mkdir /tmp/mnt
# echo foobar > /tmp/mnt/X
# mount -F lofs /usr/bin/ls /tmp/mnt/X
# ls -l /tmp/mnt

total 67
-r-xr-xr-x   1 root     bin        33396 Jun 16 05:43 X

# cd /tmp/mnt
# ./X
X
# ./X -l

total 67
-r-xr-xr-x   1 root     bin        33396 Jun 16 05:43 X

The above feature incidentally inspired item nr 10 on my ZFS feature wish list.

This allows for a lot of flexibility. In deed this functionality is central to how file systems and disk space is provisioned in Solaris Zones. If you play around with it you will find plenty of uses for it!

Some days I'm glad that I'm not a network administrator

2008-07-07T10:33:00.003+02:00

ZFS missing features

2008-07-06T21:36:00.002+02:00

What would truly make ZFS be The Last Word in File Systems (PDF)?

Why every feature of course! Here is my wishlist!

Nested vdevs (eg Raid 1+Z)
Hirarchical Storage management (migrate rarely used files to cheaper/slower vdevs)
Traditional Unix Quotas (i.e for when you have multiple users owning files in a the same directories spread out across a file system)
A way to convert a directory on a ZFS file system into a new ZFS file system, and the corresponding reverse function to merge a directory back into its parent (because the admin made some wrong decision)
Backup function supporting partial restores. In fact partial backups should be possible too, eg backing up any directory or file list, not necesarily only at the file system level. And restores which does not require the file system to be unmounted / re-mounted.
Re-layout of pools (to accomodate adding disks to a raidz or converting a non-redundant pool to raidz or removing disks from a pool, etc) (Yes I'm aware of some work in this regard)
Built-in Multi-pathing capabilities (with automatic/intelligent detection of active paths to devices), eg integrated MPxIO functionality. I'm guessing this is not there yet because people may want to use MPxIO for other devices not under ZFS control and this will create situations where there are redundant layers of multipathing logic.
True Global File System functionality (multiple hosts accessing the same LUNs and mounting the same file systems with parallel write. Or even just a sharezfs (like sharenfs, but allowing the client to access ZFS features, eg to set ZFS properties, create datasets, snapshots, etc, similar in functionality to what is possible with granting a zone ownership of a zfs dataset.)
While we're at it: In place conversion from, eg UFS to ZFS.
The ability to snapshot a single file in a ZFS file system (So that you can affect per-file version tracking)
An option on the zpool create command to take a list of disks and automatically set up a layout, intelligently taking into considderation the number of disks and the number of controllers, allowing the user to select from a set of profiles determining optimization for performance, space or redundancy.

So... what would it take to see ZFS as the new default file system on, for example USB thumb drives, memory cards for digital cameras and cell phones, etc? In fact, can't we use ZFS for RAM management too (snapshot system memory)?

The Pupil will surpass the tutor

2008-07-05T20:57:00.000+02:00

Linux is an attempt at making a free clone of Unix. Initially it aimed to be Unix compatible, though I feel that goal has become less and less important as Linux grew in maturity.

Now all of a sudden we have a complete turn-about as the big Unices want to be like Linux! - Linux is attractive for a variety of reasons, including a fast, well refined kernel, lots of readily available and free applications, good support and, because of these, a growing and loyal following. The utilities available with most Linux distributions are based on the core utilities found in the big Unices and have a large collection of new additions, all working together in a more or less coherent way to build a usable platform.

Nowadays many new Unix administrators have at least some Linux experience and with this background can be easily frustrated when looking for Linux specific utilities (Where is top in Solaris). End users would like to see the applications that used on Linux to run on Unix. And the ability to run Linux on cheap and cheerful PC hardware does not detract from Linux's popularity by any means.

So Sun Microsystems, just like IBM with AIX, finds itself looking at Linux to see what this platform is doing right to make it sucesfull. To me this, more than anything else, is proof that Linux finally grew up.

I expect that a leap-frog game will emmerge between Linux and Unix, particularly Solaris, where the two are competing with innovative features to be the platform of choice for both datacenter and desktop applications.

Congratulations to the Linux community on a job well done.