Showing posts with label Solaris. Show all posts
Showing posts with label Solaris. Show all posts

Thursday, February 17, 2011

Live Upgrade to install the recommended patch cluster on a ZFS snapshot

Live Upgrade used to require that you find some free slices (partitions) and then fidget with the -R "alternate Root" options to install the patch cluster to an ABE. With ZFS all of those pains have just ... gone away ...

Nowadays Live Upgrade on ZFS don't even copy the installation, instead it automatically clones a snapshot of the boot environment, saving much time and disk space! Even the patch install script is geared towards patching an Alternate Boot Environment!

The patching process involves six steps:

  1. Apply Pre-requisite patches
  2. Create an Alternate Boot Environment
  3. Apply the patch cluster to this ABE
  4. Activate the ABE
  5. Reboot
  6. Cleanup

Note: The system remains online throughout all except the reboot step.

In preparation you uncompress the downloaded patch cluster file. I create a zfs file system and mounted it on /patches, and extracted the cluster in there. Furthermore, you have to read the cluster README file - it contains a "password" needed to install, and information about pre-requisites and gotches. Read the file. This is your job!

The pre-requisites are essentially just patches to the patch-add tools, conveniently included in the Patch Cluster!

Step 1 - Install the pre-requisites for applying the cluster to the ABE

# cd /patches/10_x86_Recommended
# ./installcluster --apply-prereq

Note - If you get an Error due to insufficient space in /var/run, see my previous blog post here!

Step 2 - Create an Alternate boot environment (ABE)

# lucreate -c s10u9 -n s10u9patched -p rpool

Checking GRUB menu...
Analyzing system configuration.
No name for current boot environment.
Current boot environment is named <s10u9>.
Creating initial configuration for primary boot environment <s10u9>.
The device </dev/dsk/c1t0d0s0> is not a root device for any boot environment; cannot get BE ID.
PBE configuration successful: PBE name <s10u9> PBE Boot Device </dev/dsk/c1t0d0s0>.
Comparing source boot environment <s10u9> file systems with the file
system(s) you specified for the new boot environment. Determining which
file systems should be in the new boot environment.
Updating boot environment description database on all BEs.
Updating system configuration files.
Creating configuration for boot environment <s10u9patched>.
Source boot environment is <s10u9>.
Creating boot environment <s10u9patched>.
Cloning file systems from boot environment <s10u9> to create boot environment <s10u9patched>.
<B>Creating snapshot</B> for <rpool/ROOT/s10_0910> on <rpool/ROOT/s10_0910@s10u9patched>.
<B>Creating clone</B> for <rpool/ROOT/s10_0910@s10u9patched> on <rpool/ROOT/s10u9patched>.
Setting canmount=noauto for </> in zone <global> on <rpool/ROOT/s10u9patched>.
Saving existing file </boot/grub/menu.lst> in top level dataset for BE <s10u9patched> as <mount-point>//boot/grub/menu.lst.prev.
File </boot/grub/menu.lst> propagation successful
Copied GRUB menu from PBE to ABE
No entry for BE <s10u9patched> in GRUB menu
Population of boot environment <s10u9patched> successful.
Creation of boot environment <s10u9patched> successful.

There is now an extra boot environment to which we can apply the Patch Cluster. Note - for what it is worth, if you just needed a test environment to play in, you can now luactivate the alternate boot environment and then make any changes to the active system. If the system breaks, all it takes to undo any and all changes is a reboot.

Step 3 - Apply the patch cluster to the BE named s10u9patched.

# cd /patches/10_x86_Recommended
# ./installcluster -B s10u9patched

I am not showing the long and boring output from the installcluster script as this blog post is already far too long. The patching runs for quite a while, plan for at least two hours. Monitor the process and check the log for warnings. Depending on how long it has been since the last patches were applied, some severe patches may be applied which can affect your ability to login after rebooting. Again: READ the README!

Step 4 - Activate the ABE.

# luactivate s10u9patched
System has findroot enabled GRUB
Generating boot-sign, partition and slice information for PBE <s10u9>
A Live Upgrade Sync operation will be performed on startup of boot environment <s10u9patched>.

Generating boot-sign for ABE <s10u9patched>
Generating partition and slice information for ABE <s10u9patched>
Copied boot menu from top level dataset.
Generating multiboot menu entries for PBE.
Generating multiboot menu entries for ABE.
Disabling splashimage
Re-enabling splashimage
No more bootadm entries. Deletion of bootadm entries is complete.
GRUB menu default setting is unaffected
Done eliding bootadm entries.

**********************************************************************

The target boot environment has been activated. It will be used when you
reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You
MUST USE either the init or the shutdown command when you reboot. If you
do not use either init or shutdown, the system will not boot using the
target BE.

**********************************************************************

In case of a failure while booting to the target BE, the following process
needs to be followed to fallback to the currently working boot environment:

1. Boot from the Solaris failsafe or boot in Single User mode from Solaris
Install CD or Network.

2. Mount the Parent boot environment root slice to some directory (like
/mnt). You can use the following commands in sequence to mount the BE:

     zpool import rpool
     zfs inherit -r mountpoint rpool/ROOT/s10_0910
     zfs set mountpoint=<mountpointName> rpool/ROOT/s10_0910
     zfs mount rpool/ROOT/s10_0910

3. Run <luactivate> utility with out any arguments from the Parent boot
environment root slice, as shown below:

     <mountpointName>/sbin/luactivate

4. luactivate, activates the previous working boot environment and
indicates the result.

5. Exit Single User mode and reboot the machine.

**********************************************************************

Modifying boot archive service
Propagating findroot GRUB for menu conversion.
File </etc/lu/installgrub.findroot> propagation successful
File </etc/lu/stage1.findroot> propagation successful
File </etc/lu/stage2.findroot> propagation successful
File </etc/lu/GRUB_capability> propagation successful
Deleting stale GRUB loader from all BEs.
File </etc/lu/installgrub.latest> deletion successful
File </etc/lu/stage1.latest> deletion successful
File </etc/lu/stage2.latest> deletion successful
Activation of boot environment <s10u9patched> successful.

# lustatus
Boot Environment           Is       Active Active    Can    Copy
Name                       Complete Now    On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
s10u9                      yes      no     no        yes    -
s10u9patched               yes      yes    yes       no     -

Carefully take note of the details on how to recover from a failure. Making a hard-copy of this is not a bad idea! Take note that you have to use either init or shutdown to effect the reboot, as the other commands will circumvent some of the delayed action scripts! Hence ...

Step 5 - Reboot using shutdown or init ...

# init 6

Monitor the boot-up sequence. A few handy commands while you are performing the upgrade, includes:

# lustatus
# bootadm list-menu
# zfs list -t all

You will eventually (after confirming that everything works as expected) want to free up the disk space held by the snapshots. The first command cleans up the redundant Live Upgrade entries as well as the relevant ZFS snapshot storage! The second is to remove the temporary ZFS file system used for the patching.

Step 6 - Cleanup

# ludelete s10u9
# zfs destroy rpool/patches

Again no worries about where the space comes from. ZFS simply manages it! Live Upgrade takes care of your grub boot menu and gives you clear instructions on how to recover it anything goes wrong.

Adding a ZFS zvol for extra swap space

ZFS sometimes truly takes the think work out of allocating and managing space on your file systems. But only sometimes.

Many operations on Solaris, OpenSolaris and Indiana will cause you to run into swap space issues. For example using the new Solaris 10 VirtualBox appliance, you will get the following message when you try to install the Recommended Patch Cluster:

Insufficient space available in /var/run to complete installation of this patch
set. On supported configurations, /var/run is a tmpfs filesystem resident in
swap. Additional free swap is required to proceed applying further patches. To
increase the available free swap, either add new storage resources to swap
pool, or reboot the system. This script may then be rerun to continue
installation of the patch set.

This is fixed easily enough by adding more swap space, like this:

# zfs create -V 1GB -b $(pagesize) rpool/swap2
# zfs set refreservation=1GB rpool/swap2
# swap -a /dev/zvol/dsk/rpool/swap2
# swap -l
swapfile             dev  swaplo blocks   free
/dev/zvol/dsk/rpool/swap 181,2       8 1048568 1048568
/dev/zvol/dsk/rpool/swap2 181,1       8 2097144 2097144

Setting the reservation is important, particularly if you plan on making the change permanent, eg by adding the new zvol as a swap entry in /etc/vfstab. ZFS does not reserve the space for swapping otherwise, so the swap system may think there is space which isn't actually there if you don't do this.

The -b option sets the volblocksize to improve swap performance by aligning the volume I/O units on disk to the size of the host architecture memory page size (4 KB on x86 systems and 8KB on SPARC, as reported by the pagesize command.)

If this is just temporary, then cleaning up afterwards is just as easy:

# swap -d /dev/zvol/dsk/rpool/swap2
# zfs destroy rpool/swap2

It is also possible to grow the existing swap volume. To do so, set a new size and refreservation for the existing volume like this:

# swap -d /dev/zvol/dsk/rpool/swap
# zfs set volsize=2g rpool/swap
# zfs set refreservation=2g rpool/swap
# swap -a /dev/zvol/dsk/rpool/swap

And finally, it is possible to do the above without unmounting/remounting the swap device, by using the following "trick":

# zfs set volsize=2g rpool/swap
# zfs set refreservation=2g rpool/swap
# swap -l | awk '/rpool.swap/ {print $3+$4}'|read OFFSET
# env NOINUSE_CHECK=1 swap -a /dev/zvol/dsk/rpool/swap $OFFSET

The above will calculate the offset in the swap device and add a new "device" to the list of swap devices. This will automatically use the added space in the zvol. The Offset will be shown as the "swaplo" value in swap -l output. Multiple swap devices on the same physical media is not ideal, but on the next reboot (or by deleting and re-adding the swap device) the system will recognise the full size of the volume.

No worries about where the space comes from. ZFS just allocates it! The flip side of the coin is that once you have quotas, reservations, allocations, indirect allocations such as from snapshots, figuring out where your space has gone can become quite tricky! I'll blog about this some time!

Monday, December 6, 2010

Useless Performance Comparisons

The point of performance comparisons or benchmark articles has to be purely sensational. By far the most of these appear to have little value other than attracting less educated readers to the relevant websites.

In a recent article Michael Larabel of Phoronix reports on the relative performance of various file systems under Linux, specifically comparing the traditional Linux file systems to the new (not yet quite available) native ZFS module. According to the article ZFS performs slower than the other Linux file systems in most of the tests, but I have a number of issues with both how the testing was done and with how the article was written.

Solaris 11 Express should have been included in the test, and the results for OpenIndiana should be shown for all tests. It is crucial that report include other system metrics such as CPU utilization during the test runs.

I also have some even more serious gripes. In particular the blanket statement that some unspecified “subset” of the tests were performed on both a standard SATA hard drive and the SSD drive, but that the results were “proportionally” the same – does not make sense as some tests are more seek latency sensitive than others, and some file systems hide these latencies better than others.

Another serious gripe is that there is no feature comparison. More complex software has more work to do, and one would expect some trade-offs.

Even worse: two of ZFS’s strengths were eliminated by the way the testing was done. Firstly when ZFS is given a “whole disk” as is recommended in the ZFS best practices (as opposed to being given just a partition) it will safely enable the disk’s write cache. It only does this if it knows that there are no other file systems on the disk, i.e when ZFS is in control of the whole disk. Secondly ZFS manages many disks very efficiently, particularly as far as is concerned allocating space: ZFS performance doesn't come into its own right on a single disk system!

Importantly, and especially so since this is very much a beta version of a port of a mature and stable product, we need to understand which of ZFS's features are present, different and/or missing compared to the mature product. For example some of ZFS’s biggest performance inhibitors under FUSE is that it is limited to a single-threaded ioctl (Ed: Apparently this is fixed in ZFS for Linux 0.6.9, but I am unable to tell whether this is the version Phoronix tested) - and not having access to the disk devices at a low level. The KQ Infotech website lists some missing features, particularly interesting is the missing Linux async I/O support. Furthermore the KQ Infotech FAQ states that Direct IO falls back to buffered read/write functions and that missing kernel APIs are being emmulated though the "Solaris Porting Layer".

A quick search highlights some serious known issues, such as the Linux VFS Cache and ZFS ARC cache copy duplication bug, a bug which heavily impacts on performance.

More information about missing features can be found on the LLNL issue tracker page.

If nothing else, the article should mention the fact that there are known severe performance issues and feature incompleteness with the Linux native ZFS module! The way in which Linux allocates and manages virtual address space is inefficient (don't take my word for it, see this and this), requiring expensive workarounds.

Besides all of this my real, main gripe is about this kind of article in general. The common practice of testing everything with “default installation settings” implies that nothing else needs to be done - however when you want the absolute best possible performance out of something, you need to tune it for the specific workload and conditions. In the case of the article in question, the statement reads “All file-systems were tested with their default mount options”, and no other information is given, such as whether the disk was partitioned, whether the different subject file systems where mounted at the same time, what disk the system was booted from and whether the operating system was running with part of the disk hosting the tested file system mounted as its root. We don’t even know whether the author read the ZFS Best Practices Guide.

It can be argued that the average person will not tune the system, or in this case the file system, for one specific workload because their workstation should be an all-round performer, but you should still comply with the best practices recommendations from the vendors, especially if performance is one of your main criteria.

I don’t know whether using defaults is ever acceptable in this kind of article. My issue stems from how these articles are written in a way that suggests that performance is the only or at least the most important factor in choosing an operating system, or file system, of graphics card, or CPU or whatever the subject is. If that were true then at least the system should be tuned to make the most of each of the subject candidates, whether these are hardware or software parts being tested and compared to one another. This tuning is often done by disabling features, configuring the relevant options, and usually to get it right you would need to have someone who is a performance expert on that piece of software or hardware to optimize it for each test.

Specific hardware (or software) often favor one or the other of the entrants. An optimized, feature poor system will outperform a complex, feature rich system on limited hardware. Making the best use of the available hardware might mean different implementation choices when optimizing for performance rather than for functionality or reliability. ZFS in particular comes into its own right, both in terms of features and performance, when it has underneath it a lot of hardware – RAM, disk controllers, and as many dedicated disk drives as possible. The other file systems have likely reached their performance limit on the limited hardware on which the testing was done. Linux is particularly aimed at the non-NUMA, single-core, Single hard drive, single user environment. Solaris, and ZFS, was developed in a company where single user workstations were an almost non-existing target, the real target of course being the large servers of the big corporates.

As documented in the ZFS Evil Tuning Guide, many tuning options exist. One could turn off ZFS check-sum calculations, limit the ARC cache sizes, set the SSD disk as a cache or log device for the SATA disk, and set the pool to cache only meta data, to mention a few. Looking at the hardware available in the Phoronix article, the choices would depend on the specific test – in another test one might stripe between the SATA disk and the SSD disk, in another you might choose to mirror across the two.

The other file system candidates might have different recommendations in terms of how to optimize for performance.

I realize that the functionality would be affected by such tuning, but the article doesn’t look at functionality, or even usability for that matter. ZFS provides good reliability and data integrity, but only in its default configuration, with data check-summing turned on. The data protection levels and usable space in each test might be different, but that again is a function of which features are used and not the subject of the article, not even mentioned anywhere in the article.

As a point in case for the argument about functionality, once needs to consider all that ZFS is doing in addition to being a POSIX compliant file system. It replaces the volume manager. It adds data integrity checking through check-summing. It manages space allocation, including space for file systems, meta-data, snapshots, and ZVOLs (virtual devices created out of a ZFS pool) automatically. Usage can be controlled by means of a complete set of reservation and quota options. Changing settings, such as turning on encryption, the number of copies of data to be kept, whether to do check-summing, etc is dynamic. There is much more as Google will tell.

And just to add insult to injury, the article goes and pits XFS against ZFS, ignoring the many severe reliability issues present with XFS, such as the often reported data corruption under heavy load and severe file system corruption when losing power.

I would really like to see a performance competition one day. The details of how the testing will be done will be given out in advance to allow the teams to research it. Each team is then given access to the same budget from which to buy and build their own system to enter into the competition. Their performance experts then set up and build the systems, and install the software and tune it for the tests on the specific hardware they have available. One team might buy a system with more CPUs while another might buy a system with more disks and SCSI controllers, but the test is fair (barring my observation about how feature poor systems will always perform better on a low-budget system) because the teams each solves the same problem with the same budget. The teams submit to the judges their ready systems to run the performance test scripts and publish their configuration details in a how-to guide. To eliminate cheating, an independent group will try to duplicate the team’s test results using the guide.

I think this would make a fun event for a LAN party – any sponsors interested?

Friday, November 7, 2008

Neat way to prevent multiple instances of a script

Sometimes you need to ensure that you can never have more than one instance of a script running at the same time. This is especially important with scripts that modifies files, and if the script runs for longer than a fraction of a second it becomes more critical. Also if you have many people administrating a system, it becomes more important to ensure they don't step on each other's toes.

Basically the problem is known as the multiple writers problem, and it is solved by something called "semaphores". "Semaphores" is a feature implemented in the kernel with the purpose of providing a way to guarantee that a piece of code be made "mutually exclusive". I don't want to go into this, it is already properly explained on many websites, but have a look at the Wikipedia article if you are interested in the topic.

One technique to get around the problem of multiple instances of a script is to use the existence of a specific file somewhere to flag other instances of the script that there is already a running instance. The "touch" command does not complain about existing files, so you need to to check first whether the file exists already, exit if it does, and create it otherwise. For example

if [ -f /tmp/already_running ] 
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file /tmp/already_running and try again.
   exit 1
else
   touch /tmp/already_running
fi
....

Near the end of this script it is common to delete the file for the next use. This, however, is not the best solution.

If two instances of the script were started at nearly the same instant, what could happen is that instance 1 checks the file, finds it does not exist, but then gets kicked off the CPU so that instance 2 can run. Instance 2 then checks the lock file and also sees it is OK to continue, and then creates the lock. Instance 1 eventually gets CPU time again and, having already previously checked the lock file, believes it is safe to continue running.

This is known as a race condition, and is by definition what Semaphores are meant to prevent. But semaphores are not easily accessible in scripts, or so it might seem.

Now I know you are asking "but what is the chance of such a precice timing of sceduled cpu time to cause this kind of race condition". Yes, the chances are probably low, especially if you are the only person using a specific script. However there is a proper way of checking that we are the only instance running, and it is even simpler to implement thatn the check-file-touch-file method!

The secret lies in the mkdir command.

if ! mkdir /tmp/already_running
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file /tmp/already_running and try again.
   exit 1
fi
....

mkdir will automatically use the kernel built-in semaphores during the actual process of creating the entry in the file system. It will fail if the directory exists, and on succesfull return, the lock will be in place already, so no extra commands are needed to complete the mutex locking process.

The second part of this is to automate the release of the lock when the script exists. Typically you want the lock to be released even if someone kills the script, press Ctrl-C, or if it terminates normally or on an error.

This is done by means of a EXIT trap, and the format of using traps in bourne shell variants is:

trap "do-something-here" EXIT

This trap must be set AFTER obtaining the lock, otherwise a second instance of the script will "inadvertendly" remove the lock obtained by the first instance of the script (because the new instance will basically remove the lock which it was not able to obtain when it exists on not being unable to obtain the lock.

You obviously don't have to have an if-then-fi to print a message to the user - if you are the only person using a script, you can simplify the checking of the lock as follow:

MUTEX_LOCK=/tmp/myscript_already_running
mkdir $MUTEX_LOCK || exit 1
trap "rmdir $MUTEX_LOCK" EXIT

With the above you will simply have an error message from mkdir which you need to interpret as "the script is already running", eg:

mkdir: Failed to make directory "/tmp/myscript_already_running"; File exists

Using this technique, a whole script might look like this:

#!/bin/ksh
# This is myscript v1.0.

#Set up the running environment
MUTEX_LOCK=/tmp/myscript_already_running
...

if ! mkdir $MUTEX_LOCK
then
   echo Can not continue - the lock file already exists!
   echo If you are sure that no other instance of this script
   echo is running, delete the file $MUTEX_LOCK and try again.
   exit 1
fi
trap "rmdir $MUTEX_LOCK" EXIT
...

# Doing work which requires only one instance of the script to be running
...

# THE END

Note there is no "remove lock" statement at the end of the script. This is handled by the trap, which executes on any exit, except of course a kill -9.

Using a kill -9 should in any case only ever be used as a last resort, because it does not allow the program to clean up after itself.

Saturday, October 18, 2008

Making the most of Solaris Man pages

Solaris man pages (manual pages) are well written, consistent, complete, and generally a great source of information. Here are a few tips to help you get the most out of them. Of course this applies to all Solaris derivatives, including proper Solaris, OpenSolaris and other Open Solaris distributions like Solaris Express, Belenix and Nexenta. It even applies to other Unix flavors like BSD and AIX, and Unix derivatives like Linux.

The simplest (and most obvious) way to use the manual pages are to enter "man command" and then read the entire page.


My first hint however is to change the "PAGER" environment variable. I suggest you set it like this in your .profile file:

PAGER="less -iMsq"; export PAGER


(or for csh and its family members, use setenv PAGER "less -iMsq")


The reasons for this are

a) "less" supports more useful options than does "more". In particular, it supports highlighting, as well as all of the below!

b) The "-i" causes less to ignore case in searches. A definite advantage because you can simply enter /user and it will find the string "user" even when capitalized for use at the start of a sentence.

c) The -M causes "less" to show a more verbose status at the bottom of the screen.

d) -q ... Stop irritating me.

e) -s to "squeeze" blank lines (because otherwise less will "format" pages with blank spaces, as if it is a printed page)

f) You can return to the first line (top) of the page by taping "1" followed by "G"

g) less is more.


The PAGER environment variable is automatically used by the man command, so you can now test it and will find that you can scroll up and down in man pages. To find a word in the man page, tap the forward-slash / followed by the string to search for, and the word wherever it is found will be highlighted in inverse text.


Note: less does have some limitations over more, but these do not apply to viewing man pages. For the curious, the limitation is in the handling of control characters in the input file. In particular, more does a better job of displaying "captured" sessions recorded from a system console or from a shell using the "script" command. Regardless, I solve these by reading capture files using cat -nvet | less -iMq (Note - I like seeing line numbers)


Second Hint: Generate the Manual Pages keyword index database (The so-called windex or apropos information). To do this, you need to run the following command once.

$ catman -w


This will run for a while as it generates and stores the man page keyword index for future use.


Once it completes, the man command's -k option and the "whatis" command will work. This allows you to find what you need much easier. For example to find all man pages for tools, commands and drivers related to Wifi and Wireless networking, you can use

$ man -k wifi wireless


Third hint: Some man pages are not automatically searched or found. In older (proper) Solaris releases, the man command has got a "default list" of locations where to find man pages. In Recent Open Solaris releases, man constructs a list of locations to search. The man manual page states:


  Search Path
     Before searching for a given name, man constructs a list  of
     candidate directories and sections. man searches for name in
     the directories specified by the MANPATH  environment  vari-
     able.

     In the absence of MANPATH, man constructs  its  search  path
     based  upon the PATH environment variable, primarily by sub-
     stituting man for the last component of  the  PATH  element.
     Special  provisions  are added to account for unique charac-
     teristics  of   directories   such   as   /sbin,   /usr/ucb,
     /usr/xpg4/bin, and others. If the file argument contains a /
     character, the dirname portion of the argument  is  used  in
     place of PATH elements to construct the search path.


Note: The (currently apparently undocumented) -p option to man will show the effective man search path.


As implied, you can set in MANPATH a list of directories containing man pages you use regularly. It is also possible to overide the man search path using the man command's -M option. This is particularly true for packages that install under /usr/local, /opt/SUNW...., etc. So to view the man page for "explorer" you would run:

man -M /opt/SUNWexplo/man explorer


(Assuming you have SUNWexplorer installed)


Finally, it helps to understand the structure of a man page. The individual manual pages are divided up into a common set of sections, and knowing them you will often skip straight to a specific (sub)section in the page when looking for specific information.


Some of these subsections are optional, but the names used are consistent. Here is a quick summary of some important sections.

NAMEWhat this page is about. This is the "whatis" information.
SYNOPSISThe summary, eg how to use a command.
DESCRIPTIONA more complete description of what this component is.
OPTIONS, OPERANDS and USAGEThese section details how a command is used, eg what command-line options are available for a specific command.
ENVIRONMENT VARIABLESAs the name suggests, this section details Environment variables which affects the behavior of the command.
EXIT STATUSAs the name implies, it explains the possible exit status values. Useful to know that the section exists though.
FILESA list of files which are relevant to the command or subsystem. In particular here you will find out where a program keeps its configuration files. For a good example, see man dumpadm.
EXAMPLESOften a great place to skip to when a manual page is complicated.
ATTRIBUTESAn often overlooked section, it explains the status of the item. For commands it tells you which packages the files are in. This section of man pages are fully explained in the "attributes" manual page.
SEE ALSOOne of the most useful parts of man pages, this gives you hints about other pages which are related to the same topic.
NOTESVarious Additional notes.


Other sections, such as SECURITY, ERRORS, SUBCOMMANDS, etc exists in some man pages, but the above are probably the most useful sections.


The entire collection of man pages are divided up into manual sections, such as Section 1 - User Commands, Section 1M - Maintenance commands for sys-admins, Section 3C, the C programming reference information and section 7 - information about device drivers. This is not hugely important, except that you need to understand that some pages occur in more than one section. For example "read" or "signal" By default, man will display the first manual page which matches the file name. If you want to see one of the other pages you need to specify the specific manual section explicitly.


For example

$ man -f signal

signal  signal (3c) - simplified signal management for application processes
signal  signal (3ucb) - simplified software signal facilities
signal  signal.h (3head)    - base signals


We can see that file "signal" has got manual entries in sections 3c (C programming reference), 3ucb (UCB/Posix version), and 3head (the Headers section)


Simply entering ||man signal|| will show the man page from "3C". To see the man page in the "3head" section, enter either

$ man -s 3head signal

or

$ man signal.3head


I often end up using the second form simply because after viewing a man page I often see that there is another related man page from the SEE ALSO section. I then exit the man page, recall the previous command, and append .section to the command line, which is to say that it is just a matter of convenience.


One last tip: There are online versions of the manual pages at http://docs.sun.com.


You don't need to learn every command and every option by heart. Knowing that you have manual pages, and that you can look up the information you need quickly by looking at related commands from SEE ALSO, and from the keyword index using man -k, and then being able to quickly search through man pages for the right section or keyword absolutely WILL make you a clever, more efficient and over all better sysadmin!

Tuesday, July 29, 2008

How Solaris disk device names work

Writing this turned out to be surprisingly difficult as the article kept on growing too long. I tried to be complete but also concise to make this a useful introduction to how Solaris uses disks and on how device naming works.

So to start at the start: Solaris builds a device tree which is persistent across reboots and even across configuration changes. Once a device is found at a specific position on the system bus, and entry is created for the device instance in the device tree, and an instance number is allocated. Predictably, the first instance of a device is zero, (e.g e1000g0) and subsequent instances of device using the same driver gets allocated instance numbers incrementally (e1000g1, e1000g2, etc). The allocated instance number is registered together with the device driver and path to the physical device in the /etc/path_to_inst file.

This specific feature of Solaris is very important in providing stable, predictable behavior across reboots and hardware changes. For disk controllers this is critical as system bootability depends on it!

With Linux, if the first disk in the system is known as /dev/sda, even if it happens to be on the second controller, or have a target number other than zero on that controller. New disk added on the first controller, or on the same controller but with a lower target number, causes the existing disk to move to /dev/sdb, and the new disk then becomes /dev/sda. This used to break systems, causing them to become non-bootable, and was being a general headache. Some methods of dealing with this exists, using unique disk identifiers and device paths based on /dev/disk/by-path, etc.

If a Solaris system is configured initially with all disks attached to the second controlled, the devices will get names starting with c1. Disks added to the first controller later on will have names starting with c0, and the existing disk device names will remain unaffected. If a new controlled is added to the system, it will get a new instance number, e.g c2, and existing disk device names will remain unaffected.

Solaris however composes disk device names (device aliases) of parts which identifies the controller, the target-id, the LUN-id, and finally the slice or partition on the disk.

I will use some examples to explain this. Looking at this device:

$ ls -lL /dev/dsk/c1t*

br--------   1 root     sys       27, 16 Jun  2 16:26 /dev/dsk/c1t0d0p0
br--------   1 root     sys       27, 17 Jun  2 16:26 /dev/dsk/c1t0d0p1
br--------   1 root     sys       27, 18 Jun  2 16:26 /dev/dsk/c1t0d0p2
br--------   1 root     sys       27, 19 Jun  2 16:26 /dev/dsk/c1t0d0p3
br--------   1 root     sys       27, 20 Jun  2 16:26 /dev/dsk/c1t0d0p4
br--------   1 root     sys       27,  0 Jun  2 16:26 /dev/dsk/c1t0d0s0
br--------   1 root     sys       27,  1 Jun  2 16:26 /dev/dsk/c1t0d0s1
br--------   1 root     sys       27, 10 Jun  2 16:26 /dev/dsk/c1t0d0s10
br--------   1 root     sys       27, 11 Jun  2 16:26 /dev/dsk/c1t0d0s11
br--------   1 root     sys       27, 12 Jun  2 16:26 /dev/dsk/c1t0d0s12
br--------   1 root     sys       27, 13 Jun  2 16:26 /dev/dsk/c1t0d0s13
br--------   1 root     sys       27, 14 Jun  2 16:26 /dev/dsk/c1t0d0s14
br--------   1 root     sys       27, 15 Jun  2 16:26 /dev/dsk/c1t0d0s15
br--------   1 root     sys       27,  2 Jun  2 16:26 /dev/dsk/c1t0d0s2
br--------   1 root     sys       27,  3 Jun  2 16:26 /dev/dsk/c1t0d0s3
br--------   1 root     sys       27,  4 Jun  2 16:26 /dev/dsk/c1t0d0s4
br--------   1 root     sys       27,  5 Jun  2 16:26 /dev/dsk/c1t0d0s5
br--------   1 root     sys       27,  6 Jun  2 16:26 /dev/dsk/c1t0d0s6
br--------   1 root     sys       27,  7 Jun  2 16:26 /dev/dsk/c1t0d0s7
br--------   1 root     sys       27,  8 Jun  2 16:26 /dev/dsk/c1t0d0s8
br--------   1 root     sys       27,  9 Jun  2 16:26 /dev/dsk/c1t0d0s9

We notice the following:

1. The entries exist as links under /dev/dsk, pointing to the device node files in the /devices tree. Actually every device has got a second instance under /dev/rdsk. The ones under /dev/dsk are "block" devices, used in a random-access manner, e.g for mounting file systems. The "raw" device links are character devices, used for low-level access functions (such as creating a new file system).

2. The device names all start with c1, indicating controller c1 - so basically all the entries above are on one controller.

3. The next part of the device name is the target-id, indicated by t0. This is determined by the SCSI target-id number set on the device, and not by the order in which disks are discovered. Any new disk added to this controller will have a new unique SCSI target number and so will not affect existing device names.

4. After the target number each disk has got a LUN-id number, in the example d0. This too is determined by the SCSI LUN-id provided by the device. Normal disks on a simple SCSI card all show up as LUN-id 0, but devices like arrays or jbods can present multiple LUNs on a target. (In such devices the target usually indicates the port number on the enclosure)

5. Finally each device identifies a partition or slice on the disk. Devices with names ending with a p# indicates a PC BIOS disk partition (sometimes called an fdisk or primary partition), and names ending with an s# indicates a Solaris slice.

This begs some more explaining. There are five device names ending with p0 through p4. The p0 device, eg c1t0d0p0, indicates the whole disk as seen by the BIOS. The c_t_d_p1 device is the first primary partition, with c_t_d_p2 being the second, etc. These devices represent all four of the allowable primary partitions, and always exists even when the partitions are not used.

In addition there are 16 devices with names ending with s0 though s15. These are Solaris "disk slices", and originate from the way disks are "partitioned" on SPARC systems. Essentially Solaris uses slices much like PCs use partitions - most Solaris disk admin utilities work with disk slices, not with fdisk or BIOS partitions.

The way the "disk" is sliced is stored in the Solaris VTOC, which resides in the first sector of the "disk". In the case of x86 systems, the VTOC exists inside one of the primary partitions, and in fact most disk utilities treats the Solaris partition as the actual disk. Solaris splits up the particular partition into "slices", thus the afore mentioned "disk slices" really refers to slices existing in a partition.

Note that Solaris disk slices are often called disk partitions, so the two can be easily confused - when documentation refers to partitions you need to make sure you understand whether PC BIOS partitions or Solaris Slices are implied. In generally if the documentation applies to SPARC hardware (as well as to x86 hardware), then partitions are Solaris slices (SPARC does not have an equivalent to the PC BIOS partition concept)

Example Disk Layout:

p1First primary Partition
p2Second primary Partition
p3
Solaris Type 0xBF or 0x80 Partition
s0Slice commonly used for root
s1Slice commonly used for swap
s2Whole disk (backup or overlap slice)
s3Custom use slice
s4Custom use slice
s5Custom use slice
s6Custom use slice, commonly /export
s7Custom use slice
s8Boot block
s9Alternates (2 cylinders)
s10x86 custom use slice
s11x86 custom use slice
s12x86 custom use slice
s13x86 custom use slice
s14x86 custom use slice
s15x86 custom use slice
p4
Extended partition
p5Example: Linux or data partition
p6Example: Linux or data partition
etcExample: Linux or data partition

Note that traditionally slice 2 "overlaps" the whole disk, and is commonly referred to as the backup slice, or slightly less commonly, called the overlap slice.

The ability to have slice numbers from 8 to 15 is x86 specific. By default slice 8 covers the area on the disk where the label, vtoc and boot record is stored. Slice 9 covers the area where the "alternates" data is stored - a two-cylinder area used to record information about relocated/errored sectors.

Another example of disk device entries:

$ ls -lL /dev/dsk/c0*

brw-r-----   1 root     sys      102, 16 Jul 14 19:45 /dev/dsk/c0d0p0
brw-r-----   1 root     sys      102, 17 Jul 14 19:45 /dev/dsk/c0d0p1
brw-r-----   1 root     sys      102, 18 Jul 14 19:45 /dev/dsk/c0d0p2
brw-r-----   1 root     sys      102, 19 Jul 14 19:12 /dev/dsk/c0d0p3
brw-r-----   1 root     sys      102, 20 Jul 14 19:45 /dev/dsk/c0d0p4
brw-r-----   1 root     sys      102,  0 Jul 14 19:45 /dev/dsk/c0d0s0
brw-r-----   1 root     sys      102,  1 Jul 14 19:45 /dev/dsk/c0d0s1
...
brw-r-----   1 root     sys      102,  8 Jul 14 19:45 /dev/dsk/c0d0s8
brw-r-----   1 root     sys      102,  9 Jul 14 19:45 /dev/dsk/c0d0s9

The above example is taken form an x86 system. Note the lack of a target number in the device names. This is particular to ATA hard drives on x86 systems. Besides that it works like normal device names I described above.

Below, comparing the block and raw device entries:

$ ls -l /dev/*dsk/c1t0d0p0

lrwxrwxrwx   1 root     root          49 Jun 26 16:22 /dev/dsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q
lrwxrwxrwx   1 root     root          53 Jun  2 16:18 /dev/rdsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q,raw

These look the same, except that the second one points to the raw device node.

For completeness' sake, some utilities used in managing disks:

format The work-horse, used to perform partitioning (including fdisk partitioning on x86 based systems), analyzing/testing the disk media for defects, tuning advanced SCSI parameters, and generally checking the status and health of disks.
rmformat Shows information about removable devices, formats media, etc.
prtvtoc Command-line utility to display information about disk geometry and more importantly, the contents of the VTOC in a human readable format, showing the layout of the Solaris slices on the disk.
fmthard Write or overwrite a VTOC on a disk. Its input format is compatible with the output produced by prtvtoc, so it is possible to copy the VTOC between two disks by means of a command like this:

prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2

This is obviously not meaningful if the second disk do not have enough space. If the disks are of different sizes, you can use something like this:

prtvtoc /dev/rdsk/c1t0d0s2 | awk '$1 != 2' | fmthard -s - /dev/rdsk/c1t1d0s2

The above awk command will cause the entry for slice 2 to be committed, and fmthard will then maintain the existing entry or, if none exists, create a default one on the target disk.

Also note, as implied above, Solaris slices can (and often do) overlap. Care needs to be taken to not have file systems on slices which overlap other slices.

iostat -En Show "Error" information about disks, and often very usefull, the firmware revisions and manufacturer's identifier strings.
format -e This reveals [enhanced|advanced] functionality, such as the cache option on SCSI disks.
format -Mm

Enable debugging output, particularly makes SCSI probe failures non-silent.

cfgadm and luxadm also deserves honorable mention here. These commands manage disk enclosures, detaching and attaching, etc. but are also used in managing some aspects of disks.

luxadm -e port

Show list of FC HBAs.

luxadm can also for example be used to set the beacon LED on individual disks in FCAL enclosures that support this function. The details are somewhat specific to the relevant enclosure.

cfgadm can be used to probe SAN connected subsystems, eg by doing:

cfgadm -c configure c2::XXXXXXXXXXXX

(where XXXXXXXXXXX is the enclosure port WWN, using controller c2)

Hopefully this gives you an idea about how disk device names, controller names, and partitions and slices all relate to one another.

Wednesday, July 16, 2008

Reading and Writing ISO images using Solaris

After my recent post on mounting ISO image files I thought I should write a quick article on the other ways of using these files: Reading a disk in to a file and burning a file to a disk. This is not a complete guide on the topic by a long shot, but if you just want the quick start answer, it is here.

If you have an iso9660 CD (or DVD) image file that you want to burn to a disk, you simply use this command:

# cdrw -i filename.iso

This will write the file named filename.iso to the default cd writer device. If working with DVD media, the session is closed (using Disk at once writing), while for CD media Track-at-once writing is used.

To create an ISO image from a disk, use this command:

# readcd dev=/dev/rdsk/c1t0d0s2 f=filename.iso speed=1 retries=20

readcd needs at least the device and the file to be specified. To discover the device, you can use the command "iostat -En" and look for the Writer device, or you can let readcd scan for a device, using a command like this:

# readcd -scanbus

scsibus1:
 1,0,0 100) 'MATSHITA' 'DVD-RAM UJ-841S ' '1.40' Removable CD-ROM
 1,1,0 101) *
 1,2,0 102) *
 1,3,0 103) *
 1,4,0 104) *
 1,5,0 105) *
 1,6,0 106) *
 1,7,0 107) *

The device 1,0,0 can be used directly, or you can convert it to the Solaris naing convention as I did in the example above.

There are of course other ways of doing it, feel free to comment and tell me about your fevourite method for reading to or burning from ISO-image files.

Wednesday, July 9, 2008

A short guide to the Solaris Loop-back file systems and mounting ISO images

The Solaris Loop-back file system is a handy bit of software, allowing you to "mount" directories, files and, in particular, CD or DVD image files in ISO-9660 format.

To make it more user friendly, build 91 of ONV introduces the ability to the mount command to automatically create the loop-back devices for ISO images! The Changelog for NV 91 has got the following note:

Issues Resolved: PSARC case 2008/290 : lofi mount BUG/RFE:6384817Need persistent lofi based mounts and direct mount(1m) support for lofi

In older releases, it was necessary to run two commands to mount an ISO image file. The first to set up a virtual device for the ISO image:

# lofiadm -a /shared/Downloads/image.iso
/dev/lofi/1

And then to mount it somewhere:

# mount -F hsfs -o ro /dev/lofi/1 /mnt

Solaris uses hsfs to indicate the "High Sierra File System" driver used to mount ISO-9660 files. Specify "-o ro" to make it Read-only, though that is the default for hsfs file systems, at least lately (I seem to recall that at one point in the past it was mandatory to specify read-only mounting explicitly.

Looking at what has been happening here, we can see the Loop-back device by running lofiadm without any options:

# lofiadm

Block Device             File                           Options
/dev/lofi/1              /shared/Downloads/image.iso -

And the mounted file system:

# df -k /mnt

Filesystem            kbytes    used   avail capacity  Mounted on
/dev/lofi/1          2915052 2915052       0   100%    /mnt

The new feature of the mount command requires a full path to the ISO file (Just like lofiadm does, at any rate it does for now)

# mount -F hsfs -o ro /shared/Downloads/image2.iso /mnt

To check the status:

# df -k /mnt

Filesystem            kbytes    used   avail capacity  Mounted on
/shared/Downloads/image2.iso
                     7781882 7781882       0   100%    /mnt

And when we run lofiadm we see it automatically created a new device, /dev/lofi/2:

# lofiadm

Block Device             File                           Options
/dev/lofi/1              /shared/Downloads/image.iso -
/dev/lofi/2              /shared/Downloads/image2.iso -

Some of the other uses of the Loop-back file system:

You can mount any directory on any other directory:

# mkdir /mnt1
# mount -F lofs -o ro /usr/spool/print /mnt2

Note the use of lofs as the file system "type". This is a bit like a hard-link to a directory, and it can exist across file systems. These can be read-write or read-only.

You can also mount any individual file onto another file:

# mkdir /tmp/mnt
# echo foobar > /tmp/mnt/X
# mount -F lofs /usr/bin/ls /tmp/mnt/X
# ls -l /tmp/mnt

total 67
-r-xr-xr-x   1 root     bin        33396 Jun 16 05:43 X
# cd /tmp/mnt
# ./X
X
# ./X -l
total 67
-r-xr-xr-x   1 root     bin        33396 Jun 16 05:43 X

The above feature incidentally inspired item nr 10 on my ZFS feature wish list.

This allows for a lot of flexibility. In deed this functionality is central to how file systems and disk space is provisioned in Solaris Zones. If you play around with it you will find plenty of uses for it!





Sunday, July 6, 2008

ZFS missing features

What would truly make ZFS be The Last Word in File Systems (PDF)?

Why every feature of course! Here is my wishlist!

  1. Nested vdevs (eg Raid 1+Z)
  2. Hirarchical Storage management (migrate rarely used files to cheaper/slower vdevs)
  3. Traditional Unix Quotas (i.e for when you have multiple users owning files in a the same directories spread out across a file system)
  4. A way to convert a directory on a ZFS file system into a new ZFS file system, and the corresponding reverse function to merge a directory back into its parent (because the admin made some wrong decision)
  5. Backup function supporting partial restores. In fact partial backups should be possible too, eg backing up any directory or file list, not necesarily only at the file system level. And restores which does not require the file system to be unmounted / re-mounted.
  6. Re-layout of pools (to accomodate adding disks to a raidz or converting a non-redundant pool to raidz or removing disks from a pool, etc) (Yes I'm aware of some work in this regard)
  7. Built-in Multi-pathing capabilities (with automatic/intelligent detection of active paths to devices), eg integrated MPxIO functionality. I'm guessing this is not there yet because people may want to use MPxIO for other devices not under ZFS control and this will create situations where there are redundant layers of multipathing logic.
  8. True Global File System functionality (multiple hosts accessing the same LUNs and mounting the same file systems with parallel write. Or even just a sharezfs (like sharenfs, but allowing the client to access ZFS features, eg to set ZFS properties, create datasets, snapshots, etc, similar in functionality to what is possible with granting a zone ownership of a zfs dataset.)
  9. While we're at it: In place conversion from, eg UFS to ZFS.
  10. The ability to snapshot a single file in a ZFS file system (So that you can affect per-file version tracking)
  11. An option on the zpool create command to take a list of disks and automatically set up a layout, intelligently taking into considderation the number of disks and the number of controllers, allowing the user to select from a set of profiles determining optimization for performance, space or redundancy.

So... what would it take to see ZFS as the new default file system on, for example USB thumb drives, memory cards for digital cameras and cell phones, etc? In fact, can't we use ZFS for RAM management too (snapshot system memory)?




Saturday, July 5, 2008

The Pupil will surpass the tutor

Linux is an attempt at making a free clone of Unix. Initially it aimed to be Unix compatible, though I feel that goal has become less and less important as Linux grew in maturity.

Now all of a sudden we have a complete turn-about as the big Unices want to be like Linux! - Linux is attractive for a variety of reasons, including a fast, well refined kernel, lots of readily available and free applications, good support and, because of these, a growing and loyal following. The utilities available with most Linux distributions are based on the core utilities found in the big Unices and have a large collection of new additions, all working together in a more or less coherent way to build a usable platform.

Nowadays many new Unix administrators have at least some Linux experience and with this background can be easily frustrated when looking for Linux specific utilities (Where is top in Solaris). End users would like to see the applications that used on Linux to run on Unix. And the ability to run Linux on cheap and cheerful PC hardware does not detract from Linux's popularity by any means.

So Sun Microsystems, just like IBM with AIX, finds itself looking at Linux to see what this platform is doing right to make it sucesfull. To me this, more than anything else, is proof that Linux finally grew up.

I expect that a leap-frog game will emmerge between Linux and Unix, particularly Solaris, where the two are competing with innovative features to be the platform of choice for both datacenter and desktop applications.

Congratulations to the Linux community on a job well done.





Tuesday, July 1, 2008

Let ZFS manage even more space more eficiently

The idea of using ZFS to manage process core dumps begs to be expanded to at least crash dumps. This also enters into the realm of Live Upgrade as it eliminates the need to sync potentially a lot of data on activation of a new BE!

Previously I created a ZFS file system in the root pool, and mounted it on /var/cores.

The same purpose would be even better served with a generic ZFS file system which can be mounted on any currently active Live-Upgrade boot environment. The discussion here suggests the use of a ZFS file system rpool/var_shared, mounted under /var/shared. Directories such as /var/crash and /var/cores can then be moved into this shared file system.

So:

/ $ pfexec ksh -o vi
/ $ zfs create rpool/var_shared
/ $ zfs set mountpoint=/var/shared rpool/var_shared
/ $ mkdir -m 1777 /var/shared/cores
/ $ mkdir /var/shared/crash
/ $ mv /var/crash/`hostname` /var/shared/crash

View my handiwork:

/ $ ls -l /var/shared

total 6
drwxrwxrwt   2 root     root           2 Jun 27 17:11 cores
drwx------   3 root     root           3 Jun 27 17:11 crash
/ $ zfs list -r rpool
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     13.3G  6.89G    44K  /rpool
rpool/ROOT                10.3G  6.89G    18K  legacy
rpool/ROOT/snv_91         5.95G  6.89G  5.94G  /.alt.tmp.b-b0.mnt/
rpool/ROOT/snv_91@snv_92  5.36M      -  5.94G  -
rpool/ROOT/snv_92         4.33G  6.89G  5.95G  /
rpool/dump                1.50G  6.89G  1.50G  -
rpool/export              6.83M  6.89G    19K  /export
rpool/export/home         6.81M  6.89G  6.81M  /export/home
rpool/swap                1.50G  8.38G  10.3M  -
rpool/export/cores          20K  2.00G    20K  /var/cores
rpool/var_shared            22K  3.00G    22K  /var/shared

Just to review the current settings for saving crash dumps:

/ $ dumpadm

      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/solwarg
  Savecore enabled: yes

Set it to use the new path I made above:

/ $ dumpadm -s /var/shared/crash/`hostname`

      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/shared/crash/solwarg
  Savecore enabled: yes

Similarly I update the process core dump settings:

/ $ coreadm -g /var/shared/cores/core.%z.%f.%u.%t
/ $ coreadm

     global core file pattern: /var/shared/cores/core.%z.%f.%u.%t
     global core file content: default
       init core file pattern: core
       init core file content: default
            global core dumps: disabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: disabled
     global core dump logging: enabled

And finally, some cleaning up:

/ $ zfs destroy rpool/export/cores
/ $ cd /var
/var $ rmdir crash
/var $ ln -s shared/crash
/var $ rmdir cores

As previously, the above soft link is just in case somewhere there is a naughty script or tool with a hard coded path to /var/crash/`hostname`. I don't expect to find something like that in oficially released Sun software, but I do some times use programs not officially released or supported by Sun.

This makes me wonder what else can I make it do! I'm looking forward to my next Live Upgrade to see how well it preserves my configuration before I attempt to move any of the spool directories from /var to /var/shared!



Monday, June 30, 2008

Use Live upgrade already

If you are still not using live upgrade, you need to make learning it a priority. It will save you hours and make your manager happy because it costs nothing and gives you a simple, good and fast method for regressing your changes. You just need a few (about 10) GB of free disk space, be it in your root ZFS pool, on an un-used disk, or even any slice on any disk in the system.

The Live Upgrade concept is simple: make a copy of your "boot environment", run the upgrade or patching against this copy (called the alternative boot environment), and finally activate it.

Creation of the new boot environment is done by running a few simple commands which copies and updates the files in the new boot environment, an operation that can (and does) take a considerable amount of time, but runs in the back ground while the system is up and running, with all services online and active.

The Live Upgrade commands comes from three packages that you should install from the target OS's install media - for example if you want to upgrade from Solaris 9 to Solaris 10, you install SUNWlucfg, SUNWluu and SUNWlur from the Solaris 10 media (or run the liveupgrade20 install script in Tools/Installers directory)

Then once this is completed, another command (luactivate) is run to confirm that the new boot environment must be activated on the next reboot. On SPARC systems, this process modifies the boot-device in the OBP, while on i386 systems it updates Grub with a new "default".

Then all that is left is to do the actual reboot. During the reboot some special files and directories will be synchronized one last time - this is because between the time the system was copied over to the clone, and the time when the reboot runs, various things could possibly change: People still log in and change their passwords, recieve and send mail, spool jobs to the printers, etc. The administrator could even create new login accounts! To deal with this, Live Upgrade will synchronize a pre-determined list of files and directories during the first boot of the new boot environment.

The list of files copied is available here, and can be customized by editing the /etc/lu/synclist file.

The Live Upgrade system has got the intelligence built in to allow the new boot environment to find the files in the old boot environment during the boot up process, so this is completely automatic.

Recent Solaris Express installations prepare for the use of live upgrade by automatically setting up a slice and mounting is as "/second_root", but you need to unmount it and remove it from /etc/vfstab before live upgrade will allow you to use it. If you don't have a free slice, make one (backup /export, unmount it, and create two smaller slices in its place, one for live upgrade and one to restore /export to). This will be cheaper than performing upgrades the traditional way.

Thursday, June 19, 2008

Using a dedicated ZFS file system to manage process core dumps

ZFS just bristles with potential. Quotas, Reservations, turning compression or atime updates on or off without unmounting. The list goes on.

So now that we have ZFS root (Since Nevada build SNV_90, and even earlier when using OpenSolaris or other distributions) lets start to make use of these features.

First thing is, on my computer I don't care about access time updates on files or directories, so I disable it.

/ $ pfexec zfs set atime=off rpool

That is not particularly spectacular in itself, but since it is there I use it. The idea is of course to save a few disk updates and the corresponding IOs.

Next: core dumps. One of my pet hates. Many processes dumps core in your home dir, these get overwritten or forgotten, and then there are any number of core files lying around all over the file systems, all off these just wasting space since I don't really intent do try to analyze any of them.

Solaris has got a great feature by which core dumps can be all directed to go to a single directory and, on top of that, to have more meaningful file names.

So the idea is to create a directory, say /var/cores and then store the core files in there for later review. But knowing myself these files will just continue to waste space until I one day decide to actually try and troubleshoot a specific issue.

To me this sounds like a perfect job for ZFS.

First I check that there is not already something called /var/cores:

/ $ ls /var/cores
/var/cores: No such file or directory

Great. Now I create it.

/ $ pfexec zfs create rpool/export/cores
/ $ pfexec zfs set mountpoint=/var/cores rpool/export/cores

And set a limit on how much space it can ever consume:

/ $ pfexec zfs set quota=2g rpool/export/cores

Note: This can easily be changed at any time, simply by setting a new quota.

Which creates the below picture.

/ $ df -h
Filesystem size used avail capacity Mounted on
rpool/ROOT/snv_91 20G 5.9G 7.0G 46% /
/devices 0K 0K 0K 0% /devices
/dev 0K 0K 0K 0% /dev
ctfs 0K 0K 0K 0% /system/contract
proc 0K 0K 0K 0% /proc
mnttab 0K 0K 0K 0% /etc/mnttab
swap 2.3G 416K 2.3G 1% /etc/svc/volatile
objfs 0K 0K 0K 0% /system/object
sharefs 0K 0K 0K 0% /etc/dfs/sharetab /usr/lib/libc/libc_hwcap1.so.1
13G 5.9G 7.0G 46% /lib/libc.so.1
fd 0K 0K 0K 0% /dev/fd
swap 2.3G 7.2M 2.3G 1% /tmp
swap 2.3G 64K 2.3G 1% /var/run
rpool/export 20G 19K 7.0G 1% /export
rpool/export/home 20G 6.8M 7.0G 1% /export/home
rpool 20G 44K 7.0G 1% /rpool
rpool/export/cores 2.0G 18K 2.0G 1% /var/cores
SHARED 61G 24K 31G 1% /shared
... snip ...

And checking the settings on the /var/cores ZFS file system

/ $ zfs get all rpool/export/cores
NAME PROPERTY VALUE SOURCE
rpool/export/cores type filesystem -
rpool/export/cores creation Thu Jun 19 14:18 2008 -
rpool/export/cores used 18K -
rpool/export/cores available 2.00G -
rpool/export/cores referenced 18K -
rpool/export/cores compressratio 1.00x -
rpool/export/cores mounted yes -
rpool/export/cores quota 2G local
rpool/export/cores reservation none default
rpool/export/cores recordsize 128K default
rpool/export/cores mountpoint /var/cores local
rpool/export/cores sharenfs off default
rpool/export/cores checksum on default
rpool/export/cores compression off default
rpool/export/cores atime off inherited from rpool
rpool/export/cores devices on default
rpool/export/cores exec on default
rpool/export/cores setuid on default
rpool/export/cores readonly off default
rpool/export/cores zoned off default
rpool/export/cores snapdir hidden default
rpool/export/cores aclmode groupmask default
rpool/export/cores aclinherit restricted default
rpool/export/cores canmount on default
rpool/export/cores shareiscsi off default
rpool/export/cores xattr on default
rpool/export/cores copies 1 default
rpool/export/cores version 3 -
rpool/export/cores utf8only off -
rpool/export/cores normalization none -
rpool/export/cores casesensitivity sensitive -
rpool/export/cores vscan off default
rpool/export/cores nbmand off default
rpool/export/cores sharesmb off default
rpool/export/cores refquota none default
rpool/export/cores refreservation none default

Note that Access-time updates on this file system is off - the setting has been inherited from the pool. The only "local" settings are the mountpoint and the quota which corresponds to the items that I've specified manually.

Now just to make new core files actually use this directory. At present, the default settings from coreadm looks like this:

/ $ coreadm
global core file pattern:
global core file content: default
init core file pattern: core
init core file content: default
global core dumps: disabled
per-process core dumps: enabled
global setid core dumps: disabled
per-process setid core dumps: disabled
global core dump logging: disabled

Looking at the coreadm man page, there is a fair amount of flexibility in what can be done. I want core files to have a name identifying the zone in which the process was running, the process executable file, and the user. I also don't want core dumps to overwrite when the same process keeps on faulting, so I will add a time stamp to the core file name.

/ $ pfexec coreadm -g /var/core/core.%z.%f.%u.%t

And then I would like to enable logging of an event any time when a core file is generated, and also to store core files for Set-UID processes:

/ $ pfexec coreadm -e global-setid -e log

And finally, just to review the core-dump settings, these now look like this:

/ $ coreadm
global core file pattern: /var/core/core.%z.%f.%u.%t
global core file content: default
init core file pattern: core
init core file content: default
global core dumps: disabled
per-process core dumps: enabled
global setid core dumps: enabled
per-process setid core dumps: disabled
global core dump logging: enabled

Now if that is not useful, I don't know what is! You will soon start to appreciate just how much space is wasted and just how truly rigid and inflexible other file systems are once you run your machine with a ZFS root!




Saturday, June 14, 2008

Update: More on how to make x86 Solaris with Grub boot verbosely

Since I posted a while ago on how to make Solaris boot verbosely, I found a better way. Or rather, I learned a bit more about this.

In stead of just adding -v to the kernel line, add "-v -m verbose"

The "-m verbose" portion passes the verbose option to SMF, giving you verbose information about startup of services.

The "-v" causes the messages which normally goes to the system log to also be emitted on the console.

My grub entry for verbose booting now looks like this:

# Solaris SNV91 Verbose Boot
title Solaris SNV_91 Verbose Boot
findroot (BE_SNV_91,1,a)
kernel$ /platform/i86pc/kernel/$ISADIR/unix -v -m verbose -B $ZFS-BOOTFS
module$ /platform/i86pc/$ISADIR/boot_archive
# End Solaris SNV_91 Verbose


Of course you don't need the entry to be in grub - if your system is not booting, use the edit feature to add these options to the kernel line in the grub item you want to boot from.

Also just a note on the spash image. Removing it is entirely optional - not removing it will not hide any bootup messages (as I previously through)

Not a day goes by that I don't learn something new about Solaris.