Saturday, March 3, 2012

Sun said it first

It takes the x86-based market years to figure out the truths of what Sun Microsystems have been saying for years, in each case. The latest example is here, I quote from the second paragraph:
There's a category of server applications that can be better served by a lower class of good enough computing, delivering much better power efficiency. Content web servers, similar to what we use at AnandTech, don't present a hugely complex workload but they do see lots of threads and have largely variable performance requirements. SeaMicro's technology reduces power consumption by using lower power CPUs and highly power optimized motherboards.
Doesn't that sound just a little too similar to what Sun have been saying since for ever? I wonder whether anybody else anywhere did as much innovation as Sun did? What a failure to market - almost every cellphone comes out with Java, yet almost nobody knows who created it. That, and Sun's habit of giving everything away for free, is why it exist no more. Good products alone doesn't make a company successful - the company needs to be able to turn those products into a profit.

Wednesday, March 30, 2011

Maintaining the Linux device driver code base

After a (sadly) failed attempt to convert my significant other to Linux, I had a discussion with her about why it failed. Root cause.

Her computer works well with Windows, not at all with Linux. The reason is that her laptop will display no better than 800x600 resolution as there is no good SIS671 graphics driver for Linux (and there is for Windows). Nothing recent, functional, supported, viable or workable.

Why isn't there one for Linux?

Because it doesn't make money. Business is the process of converting time into money. Sales people get customers to buy a product or service. Technical people produce the products or deliver the services. Management and administration functions supported and enables the business to operate as a whole. (Or so the theory goes, but that is another story)

And because programmers also need a place to live. And to feed the kids.

There is cost involved – an investment, and there is a price, the return on the investment. A product, in this case a device driver for the graphics processor, needs to be designed, produced and supported. The technical people and the tools they need to do this do not come cheap.

Device drivers for Linux, however, does not make much, if any money for the companies involved. People do not pay for device drivers, rather they (rightly) expect it to be included in the cost of the hardware.

Even closed source Linux drivers are free – the vendor have to cover the costs through the sales of hardware. But the business model is flawed – The cost to deliver the Linux device driver far exceeds the income generated from hardware sales to Linux users. Thus this expense must be subsidised from sales to Windows users.

Unless the Linux user base grows to reach a critical mass, the point where enough Linux users buy the hardware to be able to justify the cost of the driver development and support, the situation will not change.

The above situation is the same, no, actualy worse for other hardware – Webcams, GPS'es, Cell-phones, USB thumb drives, bluetooth hardware, Wi-Fi and network cards, Fingerprint readers and touch-pad input devices. Every single bit of hardware.

The Linux kernel includes almost all device drivers for the hardware because of this situation. It is the only way the Linux community can use most of the consumer hardware available in the world today – that is, by developing the needed device drivers themselves.

As a result Linux supports much much more hardware than Windows does. Windows depends on the driver disks that ships with the hardware because Microsoft does not provide driver software for every bit of hardware out there!

The more you think about it, the more you realise just how unbalanced the situation really is! Microsoft sells its Windows operating system with only basic device drivers included – for proper functionality, features and performance, you need to load the hardware manufacturer's drivers. The hardware manufacturers provide the device drivers because otherwise they would lose the majority of their market – Windows users.

The Linux community, an entity that makes no money, needs to provide device drivers created through donated effort. I am aware of the exceptions, but that does not change the overall picture. The effort to maintain and update the base of device drivers included in the Linux kernel increases as the number of pieces of hardware to be supported increases. In other words: Every time a new piece of hardware appears in the shops.

To add insult to injury, the Linux community locks themselves in with the GPL license, which means they can not, for example, utilise and share effort by other Unix or BSD distributions because the Linux kernel enforces the use of the restrictive GPL.

Even worse, a Linux device driver works only on a specific release of the kernel. This is because the kernel interfaces for device drivers changes, and as a result the device driver needs to be re-compiled for every update, even minor updates, to the kernel. The amount of extra work this would place on hardware manufacturers to ensure that their device driver works on every kernel version is significant, and much more than what is needed for, for example, Windows or Solaris.

The long and the short of it is that to produce and maintain device drivers for Linux is prohibitively expensive, and the market loss as a result of not supporting Linux users is essentially negligible to most hardware manufacturers' bottomline!

Regarding the market share situation: I have long held the belief that through “allowing” us to copy Windows, Bill Gates got the world to using MS Windows. It is what most people grew up with on our computers at home, and what we as a result expected when we entered the workplace. More than just the majority of the work force, today's computer gamer is tomorrow's IT business decision maker.

But there is some light on the horizon: The Wayland Display Server may just give the Linux graphics stack the performance boost it needs to make it a viable gaming platform, which in turn will gain it the adoption of many gamers, and in the long run more market share on the desktop. Now if only Linus would fix the device driver ABIs and APIs to make it that bit easier for hardware manufacturers to support their device driver software on Linux...

There is a lot of fud on the net about how the "deliberately dynamic ABIs" of the Linux kernel makes Linux drivers better maintained, less buggy, etc. Sigh.

Thursday, February 17, 2011

Live Upgrade to install the recommended patch cluster on a ZFS snapshot

Live Upgrade used to require that you find some free slices (partitions) and then fidget with the -R "alternate Root" options to install the patch cluster to an ABE. With ZFS all of those pains have just ... gone away ...

Nowadays Live Upgrade on ZFS don't even copy the installation, instead it automatically clones a snapshot of the boot environment, saving much time and disk space! Even the patch install script is geared towards patching an Alternate Boot Environment!

The patching process involves six steps:

  1. Apply Pre-requisite patches
  2. Create an Alternate Boot Environment
  3. Apply the patch cluster to this ABE
  4. Activate the ABE
  5. Reboot
  6. Cleanup

Note: The system remains online throughout all except the reboot step.

In preparation you uncompress the downloaded patch cluster file. I create a zfs file system and mounted it on /patches, and extracted the cluster in there. Furthermore, you have to read the cluster README file - it contains a "password" needed to install, and information about pre-requisites and gotches. Read the file. This is your job!

The pre-requisites are essentially just patches to the patch-add tools, conveniently included in the Patch Cluster!

Step 1 - Install the pre-requisites for applying the cluster to the ABE

# cd /patches/10_x86_Recommended
# ./installcluster --apply-prereq

Note - If you get an Error due to insufficient space in /var/run, see my previous blog post here!

Step 2 - Create an Alternate boot environment (ABE)

# lucreate -c s10u9 -n s10u9patched -p rpool

Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named <s10u9>.
Creating initial configuration for primary boot environment <s10u9>.
The device </dev/dsk/c1t0d0s0> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <s10u9> PBE Boot Device </dev/dsk/c1t0d0s0>.
Comparing source boot environment <s10u9> file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <s10u9patched>.
Source boot environment is <s10u9>.
Creating boot environment <s10u9patched>.
Cloning file systems from boot environment <s10u9> to create boot environment <s10u9patched>.
<B>Creating snapshot</B> for <rpool/ROOT/s10_0910> on <rpool/ROOT/s10_0910@s10u9patched>.
<B>Creating clone</B> for <rpool/ROOT/s10_0910@s10u9patched> on <rpool/ROOT/s10u9patched>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/s10u9patched>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <s10u9patched> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <s10u9patched> in GRUB menu
Population of boot environment <s10u9patched> successful.
Creation of boot environment <s10u9patched> successful.

There is now an extra boot environment to which we can apply the Patch Cluster. Note - for what it is worth, if you just needed a test environment to play in, you can now luactivate the alternate boot environment and then make any changes to the active system. If the system breaks, all it takes to undo any and all changes is a reboot.

Step 3 - Apply the patch cluster to the BE named s10u9patched.

# cd /patches/10_x86_Recommended
# ./installcluster -B s10u9patched

I am not showing the long and boring output from the installcluster script as this blog post is already far too long. The patching runs for quite a while, plan for at least two hours. Monitor the process and check the log for warnings. Depending on how long it has been since the last patches were applied, some severe patches may be applied which can affect your ability to login after rebooting. Again: READ the README!

Step 4 - Activate the ABE.

# luactivate s10u9patched
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <s10u9>
A Live Upgrade Sync operation will be performed on startup of boot environment <s10u9patched>.

Generating boot-sign for ABE <s10u9patched>
Generating partition and slice information for ABE <s10u9patched>
Copied boot menu from top level dataset.
Generating multiboot menu entries for PBE.
Generating multiboot menu entries for ABE.
Disabling splashimage
Re-enabling splashimage
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

**********************************************************************

The target boot environment has been activated. It will be used when you
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You
MUST USE either the init or the shutdown command when you reboot. If you
do not use either init or shutdown, the system will not boot using the
target BE.

**********************************************************************

In case of a failure while booting to the target BE, the following process
needs to be followed to fallback to the currently working boot environment:

1. Boot from the Solaris failsafe or boot in Single User mode from Solaris
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like
/mnt). You can use the following commands in sequence to mount the BE:

     zpool import rpool
     zfs inherit -r mountpoint rpool/ROOT/s10_0910
     zfs set mountpoint=<mountpointName> rpool/ROOT/s10_0910
     zfs mount rpool/ROOT/s10_0910

3. Run <luactivate> utility with out any arguments from the Parent boot
environment root slice, as shown below:

     <mountpointName>/sbin/luactivate

4. luactivate, activates the previous working boot environment and
indicates the result.

5. Exit Single User mode and reboot the machine.

**********************************************************************

Modifying boot archive service
Propagating findroot GRUB for menu conversion.
File </etc/lu/installgrub.findroot> propagation successful
File </etc/lu/stage1.findroot> propagation successful
File </etc/lu/stage2.findroot> propagation successful
File </etc/lu/GRUB_capability> propagation successful
Deleting stale GRUB loader from all BEs.
File </etc/lu/installgrub.latest> deletion successful
File </etc/lu/stage1.latest> deletion successful
File </etc/lu/stage2.latest> deletion successful
Activation of boot environment <s10u9patched> successful.

# lustatus
Boot Environment           Is       Active Active    Can    Copy
Name                       Complete Now    On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
s10u9                      yes      no     no        yes    -
s10u9patched               yes      yes    yes       no     -

Carefully take note of the details on how to recover from a failure. Making a hard-copy of this is not a bad idea! Take note that you have to use either init or shutdown to effect the reboot, as the other commands will circumvent some of the delayed action scripts! Hence ...

Step 5 - Reboot using shutdown or init ...

# init 6

Monitor the boot-up sequence. A few handy commands while you are performing the upgrade, includes:

# lustatus
# bootadm list-menu
# zfs list -t all

You will eventually (after confirming that everything works as expected) want to free up the disk space held by the snapshots. The first command cleans up the redundant Live Upgrade entries as well as the relevant ZFS snapshot storage! The second is to remove the temporary ZFS file system used for the patching.

Step 6 - Cleanup

# ludelete s10u9
# zfs destroy rpool/patches

Again no worries about where the space comes from. ZFS simply manages it! Live Upgrade takes care of your grub boot menu and gives you clear instructions on how to recover it anything goes wrong.

Adding a ZFS zvol for extra swap space

ZFS sometimes truly takes the think work out of allocating and managing space on your file systems. But only sometimes.

Many operations on Solaris, OpenSolaris and Indiana will cause you to run into swap space issues. For example using the new Solaris 10 VirtualBox appliance, you will get the following message when you try to install the Recommended Patch Cluster:

Insufficient space available in /var/run to complete installation of this patch
set. On supported configurations, /var/run is a tmpfs filesystem resident in
swap. Additional free swap is required to proceed applying further patches. To
increase the available free swap, either add new storage resources to swap
pool, or reboot the system. This script may then be rerun to continue
installation of the patch set.

This is fixed easily enough by adding more swap space, like this:

# zfs create -V 1GB -b $(pagesize) rpool/swap2
# zfs set refreservation=1GB rpool/swap2
# swap -a /dev/zvol/dsk/rpool/swap2
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 181,2       8 1048568 1048568
/dev/zvol/dsk/rpool/swap2 181,1       8 2097144 2097144

Setting the reservation is important, particularly if you plan on making the change permanent, eg by adding the new zvol as a swap entry in /etc/vfstab. ZFS does not reserve the space for swapping otherwise, so the swap system may think there is space which isn't actually there if you don't do this.

The -b option sets the volblocksize to improve swap performance by aligning the volume I/O units on disk to the size of the host architecture memory page size (4 KB on x86 systems and 8KB on SPARC, as reported by the pagesize command.)

If this is just temporary, then cleaning up afterwards is just as easy:

# swap -d /dev/zvol/dsk/rpool/swap2
# zfs destroy rpool/swap2

It is also possible to grow the existing swap volume. To do so, set a new size and refreservation for the existing volume like this:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=2g rpool/swap
# zfs set refreservation=2g rpool/swap
# swap -a /dev/zvol/dsk/rpool/swap

And finally, it is possible to do the above without unmounting/remounting the swap device, by using the following "trick":

# zfs set volsize=2g rpool/swap
# zfs set refreservation=2g rpool/swap
# swap -l | awk '/rpool.swap/ {print $3+$4}'|read OFFSET
# env NOINUSE_CHECK=1 swap -a /dev/zvol/dsk/rpool/swap $OFFSET

The above will calculate the offset in the swap device and add a new "device" to the list of swap devices. This will automatically use the added space in the zvol. The Offset will be shown as the "swaplo" value in swap -l output. Multiple swap devices on the same physical media is not ideal, but on the next reboot (or by deleting and re-adding the swap device) the system will recognise the full size of the volume.

No worries about where the space comes from. ZFS just allocates it! The flip side of the coin is that once you have quotas, reservations, allocations, indirect allocations such as from snapshots, figuring out where your space has gone can become quite tricky! I'll blog about this some time!

Monday, December 6, 2010

Useless Performance Comparisons

The point of performance comparisons or benchmark articles has to be purely sensational. By far the most of these appear to have little value other than attracting less educated readers to the relevant websites.

In a recent article Michael Larabel of Phoronix reports on the relative performance of various file systems under Linux, specifically comparing the traditional Linux file systems to the new (not yet quite available) native ZFS module. According to the article ZFS performs slower than the other Linux file systems in most of the tests, but I have a number of issues with both how the testing was done and with how the article was written.

Solaris 11 Express should have been included in the test, and the results for OpenIndiana should be shown for all tests. It is crucial that report include other system metrics such as CPU utilization during the test runs.

I also have some even more serious gripes. In particular the blanket statement that some unspecified “subset” of the tests were performed on both a standard SATA hard drive and the SSD drive, but that the results were “proportionally” the same – does not make sense as some tests are more seek latency sensitive than others, and some file systems hide these latencies better than others.

Another serious gripe is that there is no feature comparison. More complex software has more work to do, and one would expect some trade-offs.

Even worse: two of ZFS’s strengths were eliminated by the way the testing was done. Firstly when ZFS is given a “whole disk” as is recommended in the ZFS best practices (as opposed to being given just a partition) it will safely enable the disk’s write cache. It only does this if it knows that there are no other file systems on the disk, i.e when ZFS is in control of the whole disk. Secondly ZFS manages many disks very efficiently, particularly as far as is concerned allocating space: ZFS performance doesn't come into its own right on a single disk system!

Importantly, and especially so since this is very much a beta version of a port of a mature and stable product, we need to understand which of ZFS's features are present, different and/or missing compared to the mature product. For example some of ZFS’s biggest performance inhibitors under FUSE is that it is limited to a single-threaded ioctl (Ed: Apparently this is fixed in ZFS for Linux 0.6.9, but I am unable to tell whether this is the version Phoronix tested) - and not having access to the disk devices at a low level. The KQ Infotech website lists some missing features, particularly interesting is the missing Linux async I/O support. Furthermore the KQ Infotech FAQ states that Direct IO falls back to buffered read/write functions and that missing kernel APIs are being emmulated though the "Solaris Porting Layer".

A quick search highlights some serious known issues, such as the Linux VFS Cache and ZFS ARC cache copy duplication bug, a bug which heavily impacts on performance.

More information about missing features can be found on the LLNL issue tracker page.

If nothing else, the article should mention the fact that there are known severe performance issues and feature incompleteness with the Linux native ZFS module! The way in which Linux allocates and manages virtual address space is inefficient (don't take my word for it, see this and this), requiring expensive workarounds.

Besides all of this my real, main gripe is about this kind of article in general. The common practice of testing everything with “default installation settings” implies that nothing else needs to be done - however when you want the absolute best possible performance out of something, you need to tune it for the specific workload and conditions. In the case of the article in question, the statement reads “All file-systems were tested with their default mount options”, and no other information is given, such as whether the disk was partitioned, whether the different subject file systems where mounted at the same time, what disk the system was booted from and whether the operating system was running with part of the disk hosting the tested file system mounted as its root. We don’t even know whether the author read the ZFS Best Practices Guide.

It can be argued that the average person will not tune the system, or in this case the file system, for one specific workload because their workstation should be an all-round performer, but you should still comply with the best practices recommendations from the vendors, especially if performance is one of your main criteria.

I don’t know whether using defaults is ever acceptable in this kind of article. My issue stems from how these articles are written in a way that suggests that performance is the only or at least the most important factor in choosing an operating system, or file system, of graphics card, or CPU or whatever the subject is. If that were true then at least the system should be tuned to make the most of each of the subject candidates, whether these are hardware or software parts being tested and compared to one another. This tuning is often done by disabling features, configuring the relevant options, and usually to get it right you would need to have someone who is a performance expert on that piece of software or hardware to optimize it for each test.

Specific hardware (or software) often favor one or the other of the entrants. An optimized, feature poor system will outperform a complex, feature rich system on limited hardware. Making the best use of the available hardware might mean different implementation choices when optimizing for performance rather than for functionality or reliability. ZFS in particular comes into its own right, both in terms of features and performance, when it has underneath it a lot of hardware – RAM, disk controllers, and as many dedicated disk drives as possible. The other file systems have likely reached their performance limit on the limited hardware on which the testing was done. Linux is particularly aimed at the non-NUMA, single-core, Single hard drive, single user environment. Solaris, and ZFS, was developed in a company where single user workstations were an almost non-existing target, the real target of course being the large servers of the big corporates.

As documented in the ZFS Evil Tuning Guide, many tuning options exist. One could turn off ZFS check-sum calculations, limit the ARC cache sizes, set the SSD disk as a cache or log device for the SATA disk, and set the pool to cache only meta data, to mention a few. Looking at the hardware available in the Phoronix article, the choices would depend on the specific test – in another test one might stripe between the SATA disk and the SSD disk, in another you might choose to mirror across the two.

The other file system candidates might have different recommendations in terms of how to optimize for performance.

I realize that the functionality would be affected by such tuning, but the article doesn’t look at functionality, or even usability for that matter. ZFS provides good reliability and data integrity, but only in its default configuration, with data check-summing turned on. The data protection levels and usable space in each test might be different, but that again is a function of which features are used and not the subject of the article, not even mentioned anywhere in the article.

As a point in case for the argument about functionality, once needs to consider all that ZFS is doing in addition to being a POSIX compliant file system. It replaces the volume manager. It adds data integrity checking through check-summing. It manages space allocation, including space for file systems, meta-data, snapshots, and ZVOLs (virtual devices created out of a ZFS pool) automatically. Usage can be controlled by means of a complete set of reservation and quota options. Changing settings, such as turning on encryption, the number of copies of data to be kept, whether to do check-summing, etc is dynamic. There is much more as Google will tell.

And just to add insult to injury, the article goes and pits XFS against ZFS, ignoring the many severe reliability issues present with XFS, such as the often reported data corruption under heavy load and severe file system corruption when losing power.

I would really like to see a performance competition one day. The details of how the testing will be done will be given out in advance to allow the teams to research it. Each team is then given access to the same budget from which to buy and build their own system to enter into the competition. Their performance experts then set up and build the systems, and install the software and tune it for the tests on the specific hardware they have available. One team might buy a system with more CPUs while another might buy a system with more disks and SCSI controllers, but the test is fair (barring my observation about how feature poor systems will always perform better on a low-budget system) because the teams each solves the same problem with the same budget. The teams submit to the judges their ready systems to run the performance test scripts and publish their configuration details in a how-to guide. To eliminate cheating, an independent group will try to duplicate the team’s test results using the guide.

I think this would make a fun event for a LAN party – any sponsors interested?

Sunday, May 10, 2009

Lost Dog!

Otto, our Dachhund cross got lost yesterday. If someone found such a dog in the area near the Stellenberg High School and did a Google search I hope that they will hit on this page. Otto is a little brown dog with realy big ears. The dog is my son, Francois' best friend, so it would be realy terrible if we never were to find it again. For the record, we live in Amanda Glen, but the dog could easily walk into the Sonstraal Heights or Stellenberg area. If you saw this dog, please call Johan at 021 910 7160 or Reinette at 021 976 3453. Thank you!

Sunday, April 26, 2009

ZFS user quotas available in SNV build 114

I noted, as per Chris Gerhard's Weblog that user and group Quotas on ZFS will be available soon - the fix to bug ID 6501037 is currently slated for inclusion in ON build 114.

Once this becomes available I will have one fewer item on my list of features missing from ZFS.

Currently to limit users' consumption the workaround documented here is to provide each user with a dedicated directory on which another dataset is mounted and a quota is set. This implies that the user can only create or write to files in that specific directory. To track and limit a user's total usage across an entire ZFS pool requires User quotas - ditto for consumption by group.

According to this post by Matty the feature is implemented in a way which enforces the rule "tardily", that is it is a little "late", and also mentions that translated SIDs (eg when the directory is shared via SMB) are supported.

The PSARC/2009/204/ document here provides details of how the quotas is implemented. Two new zfs subcommands, namely zfs userspace and zfs groupspace reports the consumption, and control is by means of a set of new properties on ZFS file system datasets.

This amounts to good news all around. Maybe I should start tracking bug IDs for all of the items on my feature wish-list!