Tuesday, July 29, 2008

How Solaris disk device names work

Writing this turned out to be surprisingly difficult as the article kept on growing too long. I tried to be complete but also concise to make this a useful introduction to how Solaris uses disks and on how device naming works.

So to start at the start: Solaris builds a device tree which is persistent across reboots and even across configuration changes. Once a device is found at a specific position on the system bus, and entry is created for the device instance in the device tree, and an instance number is allocated. Predictably, the first instance of a device is zero, (e.g e1000g0) and subsequent instances of device using the same driver gets allocated instance numbers incrementally (e1000g1, e1000g2, etc). The allocated instance number is registered together with the device driver and path to the physical device in the /etc/path_to_inst file.

This specific feature of Solaris is very important in providing stable, predictable behavior across reboots and hardware changes. For disk controllers this is critical as system bootability depends on it!

With Linux, if the first disk in the system is known as /dev/sda, even if it happens to be on the second controller, or have a target number other than zero on that controller. New disk added on the first controller, or on the same controller but with a lower target number, causes the existing disk to move to /dev/sdb, and the new disk then becomes /dev/sda. This used to break systems, causing them to become non-bootable, and was being a general headache. Some methods of dealing with this exists, using unique disk identifiers and device paths based on /dev/disk/by-path, etc.

If a Solaris system is configured initially with all disks attached to the second controlled, the devices will get names starting with c1. Disks added to the first controller later on will have names starting with c0, and the existing disk device names will remain unaffected. If a new controlled is added to the system, it will get a new instance number, e.g c2, and existing disk device names will remain unaffected.

Solaris however composes disk device names (device aliases) of parts which identifies the controller, the target-id, the LUN-id, and finally the slice or partition on the disk.

I will use some examples to explain this. Looking at this device:

$ ls -lL /dev/dsk/c1t*

br--------   1 root     sys       27, 16 Jun  2 16:26 /dev/dsk/c1t0d0p0
br--------   1 root     sys       27, 17 Jun  2 16:26 /dev/dsk/c1t0d0p1
br--------   1 root     sys       27, 18 Jun  2 16:26 /dev/dsk/c1t0d0p2
br--------   1 root     sys       27, 19 Jun  2 16:26 /dev/dsk/c1t0d0p3
br--------   1 root     sys       27, 20 Jun  2 16:26 /dev/dsk/c1t0d0p4
br--------   1 root     sys       27,  0 Jun  2 16:26 /dev/dsk/c1t0d0s0
br--------   1 root     sys       27,  1 Jun  2 16:26 /dev/dsk/c1t0d0s1
br--------   1 root     sys       27, 10 Jun  2 16:26 /dev/dsk/c1t0d0s10
br--------   1 root     sys       27, 11 Jun  2 16:26 /dev/dsk/c1t0d0s11
br--------   1 root     sys       27, 12 Jun  2 16:26 /dev/dsk/c1t0d0s12
br--------   1 root     sys       27, 13 Jun  2 16:26 /dev/dsk/c1t0d0s13
br--------   1 root     sys       27, 14 Jun  2 16:26 /dev/dsk/c1t0d0s14
br--------   1 root     sys       27, 15 Jun  2 16:26 /dev/dsk/c1t0d0s15
br--------   1 root     sys       27,  2 Jun  2 16:26 /dev/dsk/c1t0d0s2
br--------   1 root     sys       27,  3 Jun  2 16:26 /dev/dsk/c1t0d0s3
br--------   1 root     sys       27,  4 Jun  2 16:26 /dev/dsk/c1t0d0s4
br--------   1 root     sys       27,  5 Jun  2 16:26 /dev/dsk/c1t0d0s5
br--------   1 root     sys       27,  6 Jun  2 16:26 /dev/dsk/c1t0d0s6
br--------   1 root     sys       27,  7 Jun  2 16:26 /dev/dsk/c1t0d0s7
br--------   1 root     sys       27,  8 Jun  2 16:26 /dev/dsk/c1t0d0s8
br--------   1 root     sys       27,  9 Jun  2 16:26 /dev/dsk/c1t0d0s9

We notice the following:

1. The entries exist as links under /dev/dsk, pointing to the device node files in the /devices tree. Actually every device has got a second instance under /dev/rdsk. The ones under /dev/dsk are "block" devices, used in a random-access manner, e.g for mounting file systems. The "raw" device links are character devices, used for low-level access functions (such as creating a new file system).

2. The device names all start with c1, indicating controller c1 - so basically all the entries above are on one controller.

3. The next part of the device name is the target-id, indicated by t0. This is determined by the SCSI target-id number set on the device, and not by the order in which disks are discovered. Any new disk added to this controller will have a new unique SCSI target number and so will not affect existing device names.

4. After the target number each disk has got a LUN-id number, in the example d0. This too is determined by the SCSI LUN-id provided by the device. Normal disks on a simple SCSI card all show up as LUN-id 0, but devices like arrays or jbods can present multiple LUNs on a target. (In such devices the target usually indicates the port number on the enclosure)

5. Finally each device identifies a partition or slice on the disk. Devices with names ending with a p# indicates a PC BIOS disk partition (sometimes called an fdisk or primary partition), and names ending with an s# indicates a Solaris slice.

This begs some more explaining. There are five device names ending with p0 through p4. The p0 device, eg c1t0d0p0, indicates the whole disk as seen by the BIOS. The c_t_d_p1 device is the first primary partition, with c_t_d_p2 being the second, etc. These devices represent all four of the allowable primary partitions, and always exists even when the partitions are not used.

In addition there are 16 devices with names ending with s0 though s15. These are Solaris "disk slices", and originate from the way disks are "partitioned" on SPARC systems. Essentially Solaris uses slices much like PCs use partitions - most Solaris disk admin utilities work with disk slices, not with fdisk or BIOS partitions.

The way the "disk" is sliced is stored in the Solaris VTOC, which resides in the first sector of the "disk". In the case of x86 systems, the VTOC exists inside one of the primary partitions, and in fact most disk utilities treats the Solaris partition as the actual disk. Solaris splits up the particular partition into "slices", thus the afore mentioned "disk slices" really refers to slices existing in a partition.

Note that Solaris disk slices are often called disk partitions, so the two can be easily confused - when documentation refers to partitions you need to make sure you understand whether PC BIOS partitions or Solaris Slices are implied. In generally if the documentation applies to SPARC hardware (as well as to x86 hardware), then partitions are Solaris slices (SPARC does not have an equivalent to the PC BIOS partition concept)

Example Disk Layout:

p1First primary Partition
p2Second primary Partition
p3
Solaris Type 0xBF or 0x80 Partition
s0Slice commonly used for root
s1Slice commonly used for swap
s2Whole disk (backup or overlap slice)
s3Custom use slice
s4Custom use slice
s5Custom use slice
s6Custom use slice, commonly /export
s7Custom use slice
s8Boot block
s9Alternates (2 cylinders)
s10x86 custom use slice
s11x86 custom use slice
s12x86 custom use slice
s13x86 custom use slice
s14x86 custom use slice
s15x86 custom use slice
p4
Extended partition
p5Example: Linux or data partition
p6Example: Linux or data partition
etcExample: Linux or data partition

Note that traditionally slice 2 "overlaps" the whole disk, and is commonly referred to as the backup slice, or slightly less commonly, called the overlap slice.

The ability to have slice numbers from 8 to 15 is x86 specific. By default slice 8 covers the area on the disk where the label, vtoc and boot record is stored. Slice 9 covers the area where the "alternates" data is stored - a two-cylinder area used to record information about relocated/errored sectors.

Another example of disk device entries:

$ ls -lL /dev/dsk/c0*

brw-r-----   1 root     sys      102, 16 Jul 14 19:45 /dev/dsk/c0d0p0
brw-r-----   1 root     sys      102, 17 Jul 14 19:45 /dev/dsk/c0d0p1
brw-r-----   1 root     sys      102, 18 Jul 14 19:45 /dev/dsk/c0d0p2
brw-r-----   1 root     sys      102, 19 Jul 14 19:12 /dev/dsk/c0d0p3
brw-r-----   1 root     sys      102, 20 Jul 14 19:45 /dev/dsk/c0d0p4
brw-r-----   1 root     sys      102,  0 Jul 14 19:45 /dev/dsk/c0d0s0
brw-r-----   1 root     sys      102,  1 Jul 14 19:45 /dev/dsk/c0d0s1
...
brw-r-----   1 root     sys      102,  8 Jul 14 19:45 /dev/dsk/c0d0s8
brw-r-----   1 root     sys      102,  9 Jul 14 19:45 /dev/dsk/c0d0s9

The above example is taken form an x86 system. Note the lack of a target number in the device names. This is particular to ATA hard drives on x86 systems. Besides that it works like normal device names I described above.

Below, comparing the block and raw device entries:

$ ls -l /dev/*dsk/c1t0d0p0

lrwxrwxrwx   1 root     root          49 Jun 26 16:22 /dev/dsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q
lrwxrwxrwx   1 root     root          53 Jun  2 16:18 /dev/rdsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q,raw

These look the same, except that the second one points to the raw device node.

For completeness' sake, some utilities used in managing disks:

format The work-horse, used to perform partitioning (including fdisk partitioning on x86 based systems), analyzing/testing the disk media for defects, tuning advanced SCSI parameters, and generally checking the status and health of disks.
rmformat Shows information about removable devices, formats media, etc.
prtvtoc Command-line utility to display information about disk geometry and more importantly, the contents of the VTOC in a human readable format, showing the layout of the Solaris slices on the disk.
fmthard Write or overwrite a VTOC on a disk. Its input format is compatible with the output produced by prtvtoc, so it is possible to copy the VTOC between two disks by means of a command like this:

prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2

This is obviously not meaningful if the second disk do not have enough space. If the disks are of different sizes, you can use something like this:

prtvtoc /dev/rdsk/c1t0d0s2 | awk '$1 != 2' | fmthard -s - /dev/rdsk/c1t1d0s2

The above awk command will cause the entry for slice 2 to be committed, and fmthard will then maintain the existing entry or, if none exists, create a default one on the target disk.

Also note, as implied above, Solaris slices can (and often do) overlap. Care needs to be taken to not have file systems on slices which overlap other slices.

iostat -En Show "Error" information about disks, and often very usefull, the firmware revisions and manufacturer's identifier strings.
format -e This reveals [enhanced|advanced] functionality, such as the cache option on SCSI disks.
format -Mm

Enable debugging output, particularly makes SCSI probe failures non-silent.

cfgadm and luxadm also deserves honorable mention here. These commands manage disk enclosures, detaching and attaching, etc. but are also used in managing some aspects of disks.

luxadm -e port

Show list of FC HBAs.

luxadm can also for example be used to set the beacon LED on individual disks in FCAL enclosures that support this function. The details are somewhat specific to the relevant enclosure.

cfgadm can be used to probe SAN connected subsystems, eg by doing:

cfgadm -c configure c2::XXXXXXXXXXXX

(where XXXXXXXXXXX is the enclosure port WWN, using controller c2)

Hopefully this gives you an idea about how disk device names, controller names, and partitions and slices all relate to one another.

Wednesday, July 16, 2008

Reading and Writing ISO images using Solaris

After my recent post on mounting ISO image files I thought I should write a quick article on the other ways of using these files: Reading a disk in to a file and burning a file to a disk. This is not a complete guide on the topic by a long shot, but if you just want the quick start answer, it is here.

If you have an iso9660 CD (or DVD) image file that you want to burn to a disk, you simply use this command:

# cdrw -i filename.iso

This will write the file named filename.iso to the default cd writer device. If working with DVD media, the session is closed (using Disk at once writing), while for CD media Track-at-once writing is used.

To create an ISO image from a disk, use this command:

# readcd dev=/dev/rdsk/c1t0d0s2 f=filename.iso speed=1 retries=20

readcd needs at least the device and the file to be specified. To discover the device, you can use the command "iostat -En" and look for the Writer device, or you can let readcd scan for a device, using a command like this:

# readcd -scanbus

scsibus1:
 1,0,0 100) 'MATSHITA' 'DVD-RAM UJ-841S ' '1.40' Removable CD-ROM
 1,1,0 101) *
 1,2,0 102) *
 1,3,0 103) *
 1,4,0 104) *
 1,5,0 105) *
 1,6,0 106) *
 1,7,0 107) *

The device 1,0,0 can be used directly, or you can convert it to the Solaris naing convention as I did in the example above.

There are of course other ways of doing it, feel free to comment and tell me about your fevourite method for reading to or burning from ISO-image files.

Wednesday, July 9, 2008

A short guide to the Solaris Loop-back file systems and mounting ISO images

The Solaris Loop-back file system is a handy bit of software, allowing you to "mount" directories, files and, in particular, CD or DVD image files in ISO-9660 format.

To make it more user friendly, build 91 of ONV introduces the ability to the mount command to automatically create the loop-back devices for ISO images! The Changelog for NV 91 has got the following note:

Issues Resolved: PSARC case 2008/290 : lofi mount BUG/RFE:6384817Need persistent lofi based mounts and direct mount(1m) support for lofi

In older releases, it was necessary to run two commands to mount an ISO image file. The first to set up a virtual device for the ISO image:

# lofiadm -a /shared/Downloads/image.iso
/dev/lofi/1

And then to mount it somewhere:

# mount -F hsfs -o ro /dev/lofi/1 /mnt

Solaris uses hsfs to indicate the "High Sierra File System" driver used to mount ISO-9660 files. Specify "-o ro" to make it Read-only, though that is the default for hsfs file systems, at least lately (I seem to recall that at one point in the past it was mandatory to specify read-only mounting explicitly.

Looking at what has been happening here, we can see the Loop-back device by running lofiadm without any options:

# lofiadm

Block Device             File                           Options
/dev/lofi/1              /shared/Downloads/image.iso -

And the mounted file system:

# df -k /mnt

Filesystem            kbytes    used   avail capacity  Mounted on
/dev/lofi/1          2915052 2915052       0   100%    /mnt

The new feature of the mount command requires a full path to the ISO file (Just like lofiadm does, at any rate it does for now)

# mount -F hsfs -o ro /shared/Downloads/image2.iso /mnt

To check the status:

# df -k /mnt

Filesystem            kbytes    used   avail capacity  Mounted on
/shared/Downloads/image2.iso
                     7781882 7781882       0   100%    /mnt

And when we run lofiadm we see it automatically created a new device, /dev/lofi/2:

# lofiadm

Block Device             File                           Options
/dev/lofi/1              /shared/Downloads/image.iso -
/dev/lofi/2              /shared/Downloads/image2.iso -

Some of the other uses of the Loop-back file system:

You can mount any directory on any other directory:

# mkdir /mnt1
# mount -F lofs -o ro /usr/spool/print /mnt2

Note the use of lofs as the file system "type". This is a bit like a hard-link to a directory, and it can exist across file systems. These can be read-write or read-only.

You can also mount any individual file onto another file:

# mkdir /tmp/mnt
# echo foobar > /tmp/mnt/X
# mount -F lofs /usr/bin/ls /tmp/mnt/X
# ls -l /tmp/mnt

total 67
-r-xr-xr-x   1 root     bin        33396 Jun 16 05:43 X
# cd /tmp/mnt
# ./X
X
# ./X -l
total 67
-r-xr-xr-x   1 root     bin        33396 Jun 16 05:43 X

The above feature incidentally inspired item nr 10 on my ZFS feature wish list.

This allows for a lot of flexibility. In deed this functionality is central to how file systems and disk space is provisioned in Solaris Zones. If you play around with it you will find plenty of uses for it!





Monday, July 7, 2008

Some days I'm glad that I'm not a network administrator

Sunday, July 6, 2008

ZFS missing features

What would truly make ZFS be The Last Word in File Systems (PDF)?

Why every feature of course! Here is my wishlist!

  1. Nested vdevs (eg Raid 1+Z)
  2. Hirarchical Storage management (migrate rarely used files to cheaper/slower vdevs)
  3. Traditional Unix Quotas (i.e for when you have multiple users owning files in a the same directories spread out across a file system)
  4. A way to convert a directory on a ZFS file system into a new ZFS file system, and the corresponding reverse function to merge a directory back into its parent (because the admin made some wrong decision)
  5. Backup function supporting partial restores. In fact partial backups should be possible too, eg backing up any directory or file list, not necesarily only at the file system level. And restores which does not require the file system to be unmounted / re-mounted.
  6. Re-layout of pools (to accomodate adding disks to a raidz or converting a non-redundant pool to raidz or removing disks from a pool, etc) (Yes I'm aware of some work in this regard)
  7. Built-in Multi-pathing capabilities (with automatic/intelligent detection of active paths to devices), eg integrated MPxIO functionality. I'm guessing this is not there yet because people may want to use MPxIO for other devices not under ZFS control and this will create situations where there are redundant layers of multipathing logic.
  8. True Global File System functionality (multiple hosts accessing the same LUNs and mounting the same file systems with parallel write. Or even just a sharezfs (like sharenfs, but allowing the client to access ZFS features, eg to set ZFS properties, create datasets, snapshots, etc, similar in functionality to what is possible with granting a zone ownership of a zfs dataset.)
  9. While we're at it: In place conversion from, eg UFS to ZFS.
  10. The ability to snapshot a single file in a ZFS file system (So that you can affect per-file version tracking)
  11. An option on the zpool create command to take a list of disks and automatically set up a layout, intelligently taking into considderation the number of disks and the number of controllers, allowing the user to select from a set of profiles determining optimization for performance, space or redundancy.

So... what would it take to see ZFS as the new default file system on, for example USB thumb drives, memory cards for digital cameras and cell phones, etc? In fact, can't we use ZFS for RAM management too (snapshot system memory)?




Saturday, July 5, 2008

The Pupil will surpass the tutor

Linux is an attempt at making a free clone of Unix. Initially it aimed to be Unix compatible, though I feel that goal has become less and less important as Linux grew in maturity.

Now all of a sudden we have a complete turn-about as the big Unices want to be like Linux! - Linux is attractive for a variety of reasons, including a fast, well refined kernel, lots of readily available and free applications, good support and, because of these, a growing and loyal following. The utilities available with most Linux distributions are based on the core utilities found in the big Unices and have a large collection of new additions, all working together in a more or less coherent way to build a usable platform.

Nowadays many new Unix administrators have at least some Linux experience and with this background can be easily frustrated when looking for Linux specific utilities (Where is top in Solaris). End users would like to see the applications that used on Linux to run on Unix. And the ability to run Linux on cheap and cheerful PC hardware does not detract from Linux's popularity by any means.

So Sun Microsystems, just like IBM with AIX, finds itself looking at Linux to see what this platform is doing right to make it sucesfull. To me this, more than anything else, is proof that Linux finally grew up.

I expect that a leap-frog game will emmerge between Linux and Unix, particularly Solaris, where the two are competing with innovative features to be the platform of choice for both datacenter and desktop applications.

Congratulations to the Linux community on a job well done.





Tuesday, July 1, 2008

Let ZFS manage even more space more eficiently

The idea of using ZFS to manage process core dumps begs to be expanded to at least crash dumps. This also enters into the realm of Live Upgrade as it eliminates the need to sync potentially a lot of data on activation of a new BE!

Previously I created a ZFS file system in the root pool, and mounted it on /var/cores.

The same purpose would be even better served with a generic ZFS file system which can be mounted on any currently active Live-Upgrade boot environment. The discussion here suggests the use of a ZFS file system rpool/var_shared, mounted under /var/shared. Directories such as /var/crash and /var/cores can then be moved into this shared file system.

So:

/ $ pfexec ksh -o vi
/ $ zfs create rpool/var_shared
/ $ zfs set mountpoint=/var/shared rpool/var_shared
/ $ mkdir -m 1777 /var/shared/cores
/ $ mkdir /var/shared/crash
/ $ mv /var/crash/`hostname` /var/shared/crash

View my handiwork:

/ $ ls -l /var/shared

total 6
drwxrwxrwt   2 root     root           2 Jun 27 17:11 cores
drwx------   3 root     root           3 Jun 27 17:11 crash
/ $ zfs list -r rpool
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool                     13.3G  6.89G    44K  /rpool
rpool/ROOT                10.3G  6.89G    18K  legacy
rpool/ROOT/snv_91         5.95G  6.89G  5.94G  /.alt.tmp.b-b0.mnt/
rpool/ROOT/snv_91@snv_92  5.36M      -  5.94G  -
rpool/ROOT/snv_92         4.33G  6.89G  5.95G  /
rpool/dump                1.50G  6.89G  1.50G  -
rpool/export              6.83M  6.89G    19K  /export
rpool/export/home         6.81M  6.89G  6.81M  /export/home
rpool/swap                1.50G  8.38G  10.3M  -
rpool/export/cores          20K  2.00G    20K  /var/cores
rpool/var_shared            22K  3.00G    22K  /var/shared

Just to review the current settings for saving crash dumps:

/ $ dumpadm

      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/crash/solwarg
  Savecore enabled: yes

Set it to use the new path I made above:

/ $ dumpadm -s /var/shared/crash/`hostname`

      Dump content: kernel pages
       Dump device: /dev/zvol/dsk/rpool/dump (dedicated)
Savecore directory: /var/shared/crash/solwarg
  Savecore enabled: yes

Similarly I update the process core dump settings:

/ $ coreadm -g /var/shared/cores/core.%z.%f.%u.%t
/ $ coreadm

     global core file pattern: /var/shared/cores/core.%z.%f.%u.%t
     global core file content: default
       init core file pattern: core
       init core file content: default
            global core dumps: disabled
       per-process core dumps: enabled
      global setid core dumps: enabled
 per-process setid core dumps: disabled
     global core dump logging: enabled

And finally, some cleaning up:

/ $ zfs destroy rpool/export/cores
/ $ cd /var
/var $ rmdir crash
/var $ ln -s shared/crash
/var $ rmdir cores

As previously, the above soft link is just in case somewhere there is a naughty script or tool with a hard coded path to /var/crash/`hostname`. I don't expect to find something like that in oficially released Sun software, but I do some times use programs not officially released or supported by Sun.

This makes me wonder what else can I make it do! I'm looking forward to my next Live Upgrade to see how well it preserves my configuration before I attempt to move any of the spool directories from /var to /var/shared!