Monday, June 30, 2008

Use Live upgrade already

If you are still not using live upgrade, you need to make learning it a priority. It will save you hours and make your manager happy because it costs nothing and gives you a simple, good and fast method for regressing your changes. You just need a few (about 10) GB of free disk space, be it in your root ZFS pool, on an un-used disk, or even any slice on any disk in the system.

The Live Upgrade concept is simple: make a copy of your "boot environment", run the upgrade or patching against this copy (called the alternative boot environment), and finally activate it.

Creation of the new boot environment is done by running a few simple commands which copies and updates the files in the new boot environment, an operation that can (and does) take a considerable amount of time, but runs in the back ground while the system is up and running, with all services online and active.

The Live Upgrade commands comes from three packages that you should install from the target OS's install media - for example if you want to upgrade from Solaris 9 to Solaris 10, you install SUNWlucfg, SUNWluu and SUNWlur from the Solaris 10 media (or run the liveupgrade20 install script in Tools/Installers directory)

Then once this is completed, another command (luactivate) is run to confirm that the new boot environment must be activated on the next reboot. On SPARC systems, this process modifies the boot-device in the OBP, while on i386 systems it updates Grub with a new "default".

Then all that is left is to do the actual reboot. During the reboot some special files and directories will be synchronized one last time - this is because between the time the system was copied over to the clone, and the time when the reboot runs, various things could possibly change: People still log in and change their passwords, recieve and send mail, spool jobs to the printers, etc. The administrator could even create new login accounts! To deal with this, Live Upgrade will synchronize a pre-determined list of files and directories during the first boot of the new boot environment.

The list of files copied is available here, and can be customized by editing the /etc/lu/synclist file.

The Live Upgrade system has got the intelligence built in to allow the new boot environment to find the files in the old boot environment during the boot up process, so this is completely automatic.

Recent Solaris Express installations prepare for the use of live upgrade by automatically setting up a slice and mounting is as "/second_root", but you need to unmount it and remove it from /etc/vfstab before live upgrade will allow you to use it. If you don't have a free slice, make one (backup /export, unmount it, and create two smaller slices in its place, one for live upgrade and one to restore /export to). This will be cheaper than performing upgrades the traditional way.

Thursday, June 19, 2008

Using a dedicated ZFS file system to manage process core dumps

ZFS just bristles with potential. Quotas, Reservations, turning compression or atime updates on or off without unmounting. The list goes on.

So now that we have ZFS root (Since Nevada build SNV_90, and even earlier when using OpenSolaris or other distributions) lets start to make use of these features.

First thing is, on my computer I don't care about access time updates on files or directories, so I disable it.

/ $ pfexec zfs set atime=off rpool

That is not particularly spectacular in itself, but since it is there I use it. The idea is of course to save a few disk updates and the corresponding IOs.

Next: core dumps. One of my pet hates. Many processes dumps core in your home dir, these get overwritten or forgotten, and then there are any number of core files lying around all over the file systems, all off these just wasting space since I don't really intent do try to analyze any of them.

Solaris has got a great feature by which core dumps can be all directed to go to a single directory and, on top of that, to have more meaningful file names.

So the idea is to create a directory, say /var/cores and then store the core files in there for later review. But knowing myself these files will just continue to waste space until I one day decide to actually try and troubleshoot a specific issue.

To me this sounds like a perfect job for ZFS.

First I check that there is not already something called /var/cores:

/ $ ls /var/cores
/var/cores: No such file or directory

Great. Now I create it.

/ $ pfexec zfs create rpool/export/cores
/ $ pfexec zfs set mountpoint=/var/cores rpool/export/cores

And set a limit on how much space it can ever consume:

/ $ pfexec zfs set quota=2g rpool/export/cores

Note: This can easily be changed at any time, simply by setting a new quota.

Which creates the below picture.

/ $ df -h
Filesystem size used avail capacity Mounted on
rpool/ROOT/snv_91 20G 5.9G 7.0G 46% /
/devices 0K 0K 0K 0% /devices
/dev 0K 0K 0K 0% /dev
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 2.3G 416K 2.3G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
sharefs 0K 0K 0K 0% /etc/dfs/sharetab /usr/lib/libc/libc_hwcap1.so.1
13G 5.9G 7.0G 46% /lib/libc.so.1
fd 0K 0K 0K 0% /dev/fd
swap 2.3G 7.2M 2.3G 1% /tmp
swap 2.3G 64K 2.3G 1% /var/run
rpool/export 20G 19K 7.0G 1% /export
rpool/export/home 20G 6.8M 7.0G 1% /export/home
rpool 20G 44K 7.0G 1% /rpool
rpool/export/cores 2.0G 18K 2.0G 1% /var/cores
SHARED 61G 24K 31G 1% /shared
... snip ...

And checking the settings on the /var/cores ZFS file system

/ $ zfs get all rpool/export/cores
NAME PROPERTY VALUE SOURCE
rpool/export/cores type filesystem -
rpool/export/cores creation Thu Jun 19 14:18 2008 -
rpool/export/cores used 18K -
rpool/export/cores available 2.00G -
rpool/export/cores referenced 18K -
rpool/export/cores compressratio 1.00x -
rpool/export/cores mounted yes -
rpool/export/cores quota 2G local
rpool/export/cores reservation none default
rpool/export/cores recordsize 128K default
rpool/export/cores mountpoint /var/cores local
rpool/export/cores sharenfs off default
rpool/export/cores checksum on default
rpool/export/cores compression off default
rpool/export/cores atime off inherited from rpool
rpool/export/cores devices on default
rpool/export/cores exec on default
rpool/export/cores setuid on default
rpool/export/cores readonly off default
rpool/export/cores zoned off default
rpool/export/cores snapdir hidden default
rpool/export/cores aclmode groupmask default
rpool/export/cores aclinherit restricted default
rpool/export/cores canmount on default
rpool/export/cores shareiscsi off default
rpool/export/cores xattr on default
rpool/export/cores copies 1 default
rpool/export/cores version 3 -
rpool/export/cores utf8only off -
rpool/export/cores normalization none -
rpool/export/cores casesensitivity sensitive -
rpool/export/cores vscan off default
rpool/export/cores nbmand off default
rpool/export/cores sharesmb off default
rpool/export/cores refquota none default
rpool/export/cores refreservation none default

Note that Access-time updates on this file system is off - the setting has been inherited from the pool. The only "local" settings are the mountpoint and the quota which corresponds to the items that I've specified manually.

Now just to make new core files actually use this directory. At present, the default settings from coreadm looks like this:

/ $ coreadm
global core file pattern:
global core file content: default
init core file pattern: core
init core file content: default
global core dumps: disabled
per-process core dumps: enabled
global setid core dumps: disabled
per-process setid core dumps: disabled
global core dump logging: disabled

Looking at the coreadm man page, there is a fair amount of flexibility in what can be done. I want core files to have a name identifying the zone in which the process was running, the process executable file, and the user. I also don't want core dumps to overwrite when the same process keeps on faulting, so I will add a time stamp to the core file name.

/ $ pfexec coreadm -g /var/core/core.%z.%f.%u.%t

And then I would like to enable logging of an event any time when a core file is generated, and also to store core files for Set-UID processes:

/ $ pfexec coreadm -e global-setid -e log

And finally, just to review the core-dump settings, these now look like this:

/ $ coreadm
global core file pattern: /var/core/core.%z.%f.%u.%t
global core file content: default
init core file pattern: core
init core file content: default
global core dumps: disabled
per-process core dumps: enabled
global setid core dumps: enabled
per-process setid core dumps: disabled
global core dump logging: enabled

Now if that is not useful, I don't know what is! You will soon start to appreciate just how much space is wasted and just how truly rigid and inflexible other file systems are once you run your machine with a ZFS root!




Saturday, June 14, 2008

Update: More on how to make x86 Solaris with Grub boot verbosely

Since I posted a while ago on how to make Solaris boot verbosely, I found a better way. Or rather, I learned a bit more about this.

In stead of just adding -v to the kernel line, add "-v -m verbose"

The "-m verbose" portion passes the verbose option to SMF, giving you verbose information about startup of services.

The "-v" causes the messages which normally goes to the system log to also be emitted on the console.

My grub entry for verbose booting now looks like this:

# Solaris SNV91 Verbose Boot
title Solaris SNV_91 Verbose Boot
findroot (BE_SNV_91,1,a)
kernel$ /platform/i86pc/kernel/$ISADIR/unix -v -m verbose -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive
# End Solaris SNV_91 Verbose


Of course you don't need the entry to be in grub - if your system is not booting, use the edit feature to add these options to the kernel line in the grub item you want to boot from.

Also just a note on the spash image. Removing it is entirely optional - not removing it will not hide any bootup messages (as I previously through)

Not a day goes by that I don't learn something new about Solaris.


Wednesday, June 4, 2008

Sharing a ZFS pool between Linux and Solaris

If you are multi-booting between Linux and Solaris (and others like FreeBSD, OpenBSD and Mac OS X, I expect) you will sooner or later encounter the problem of how to share disk space between the operating systems. FAT32 is not satisfactory due to its lack of POSIX features, in particular file ownership and access modes, not to mention its sub par performance. ext2/3 is not an option because you only get read-only support for it in Solaris, and similarly UFS enjoys only read-only support in Linux. The whole situation is rather depressing.


Enter ZFS.


This all started because I discovered that I can have a ZFS root file system without having to install OpenSolaris. The trick as some of you may know, is to select "Solaris Express" from the first menu on booting the install disk, and then select one of the two "Interactive Text" options from the next menu. This puts you back into 1984 in terms of installers, but you get the option of using ZFS for root!


Note: It might be possible to do this with the default installer, but on my computer the installer just would not run (I got some daft error about fonts and mouse themes). With a ZFS root, the Swap and Dump automatically goes onto dedicated vdevs, and you save a lot in terms of pre-allocated space.


I have of course used ZFS on my laptop previously as a test, but the benefits were limited by the fact that I still had "slices" for the OS and a small ZFS pool on a spare slice.


I'm not sure which build of Nevada first introduced the ZFS root option in the installer, but it is available in build 90 at least.


My choice of Linux distribution is Ubuntu 8.04. The steps to setting up a ZFS pool shared across operating systems are as follow:


1. Select a Partitioning scheme with minimal space allocated to each of Ubuntu and Nevada.
I decided to put Ubuntu in an Extended partition with a 10 GB Logical Partition for the OS, /var and /home, and a 1 GB Logical partition for Swap.
For Solaris I allocated a 24 GB primary partition to become the ZFS root pool, which includes Swap, Dump, OS and Live-upgrade space.
The balance of the 100 GB disk will be shared between Ubuntu and Solaris using ZFS.


Note: Linux and Solaris has got some different views on how disk partitioning works.
Due to historical reasons, in particular due to compatibility with Solaris on SPARC hardware, Solaris slices live in a single primary partition with an identifier of 0x82 (SOLARIS) or 0xbf (SOLARIS2) somewhat like how logical fdisk partitions live inside an "extended partition".


2. Install Ubuntu first, creating only the partitions for it. Remember to not have any external drives connected as it can screw up the order in which drives are detected and as a result bugger up the Grub menu list.


During the installation you create an Admin user. This will eventually in the future become a "backup" admin user.


3. Reboot and load patches/updates, and backup the Grub /boot/grub/menu.lst file to an external media such as a USB thumb drive for easy access. The Ubuntu Grub does not understand ZFS, so you need to use Nevada's Grub to manage the multi-booting.


4. Also set Ubuntu to use the hardware clock as local time in stead of UTC. (This is what Solaris uses) To do this change UTC=yes to UTC=no in /etc/default/rcS, then reboot.


5. Install Nevada. Use either of the Interactive Text installer options, but for simplicity's sake specify the system as non-networked.


6. Reboot and create a user for every-day use, and add this user to the "Primary Administrator" role using usermod -P "Primary Administrator" <userid>


7. Add the Ubuntu Grub entries you saved in step 3 to the end of the Nevada grub menu.lst file. This will be stored in /boot/grub/menu.lst (The default pool name is rpool)


8. Reboot back into Ubuntu, then follow the Linux ZFS-FUSE instalation instructions to get ZFS-FUSE installed. I used the trunk to get the latest ZFS updates from Opensolaris.org included. Also see this Ubuntu Wiki page, and Ralf Hildebrand's blog for more info.


For reference, this is the procedure I used

apt-get install mercurial build-essential scons libfuse-dev libaio-dev devscripts build-essential zlib1g-dev
cd ~
hg clone http://www.wizy.org/mercurial/zfs-fuse/trunk
cd trunk/src
scons
sudo scons install


9. Create an fdisk partition for the shared ZFS pool using the remaining disk space. I used a primary partition and set the identifier to W95 FAT32, though this is probably unimportant.


10. While still running running Ubuntu, create a ZFS pool on this new fdisk partition using a command like this:

sudo /usr/local/sbin/zfs-fuse
sudo /usr/local/sbin/zpool create -m /shared SHARED


I like to give my ZFS pools names in all-capitals, purely because it makes the ZFS pool devices stand out better in the output from df and mount.


WARNING: I found that if I created the ZFS pool under Solaris, it refused to import into Ubuntu, but if I created it under Linux it imports/exports just fine in both directions. Both pools are created as version 10 pools, so the reason for this is not obvious. If you do decide to experiment with creating the pool under Solaris, when you want to realy get rid of the pool you will discover you need to dd zeros over the pool before creating it again, otherwise the condition remains unchanged despite destroying and re-creating the pool. If you do experiment with this please do share your results!


11. Export the ZFS pool using

/usr/local/sbin/zpool export SHARED


12. Reboot into Nevada and import the pool using

/usr/local/sbin/zpool import SHARED


Note: If you forget to export before you shut down, you will need to add -f to force the import after booting into the other OS.


At this point I just sat there and stared in wonder at how well it actually works. There is beauty in finally seeing this working!


13. Create some init.d / rc scripts to automate the import/export on shutdown/startup.


14. Now you can start customizing both operating environments. You may want to setup Automatic network configuration by enabling the SMF for NWAM in Solaris, eg by doing:

pfexec svcadm disable physical:default
pfexec svcadm enable physical:nwam


I'm looking forward to testing Live Upgrade on my setup with ZFS root, and to getting a shared home directory to work well for both Solaris and Ubuntu. I have created a login ID with the same gid/uid and a home directory under the shared ZFS pool, but after a few changes it got broken under Ubuntu, probably due to subtle differences in how Gnome/Desktop config items are stored and/or expected.


Despite my initial sceptism about FUSE, it is actually quite functional. All-in-all I love being able to share a file system, well, many files systems, between the two operating environments!