Tuesday, July 29, 2008

How Solaris disk device names work

Writing this turned out to be surprisingly difficult as the article kept on growing too long. I tried to be complete but also concise to make this a useful introduction to how Solaris uses disks and on how device naming works.

So to start at the start: Solaris builds a device tree which is persistent across reboots and even across configuration changes. Once a device is found at a specific position on the system bus, and entry is created for the device instance in the device tree, and an instance number is allocated. Predictably, the first instance of a device is zero, (e.g e1000g0) and subsequent instances of device using the same driver gets allocated instance numbers incrementally (e1000g1, e1000g2, etc). The allocated instance number is registered together with the device driver and path to the physical device in the /etc/path_to_inst file.

This specific feature of Solaris is very important in providing stable, predictable behavior across reboots and hardware changes. For disk controllers this is critical as system bootability depends on it!

With Linux, if the first disk in the system is known as /dev/sda, even if it happens to be on the second controller, or have a target number other than zero on that controller. New disk added on the first controller, or on the same controller but with a lower target number, causes the existing disk to move to /dev/sdb, and the new disk then becomes /dev/sda. This used to break systems, causing them to become non-bootable, and was being a general headache. Some methods of dealing with this exists, using unique disk identifiers and device paths based on /dev/disk/by-path, etc.

If a Solaris system is configured initially with all disks attached to the second controlled, the devices will get names starting with c1. Disks added to the first controller later on will have names starting with c0, and the existing disk device names will remain unaffected. If a new controlled is added to the system, it will get a new instance number, e.g c2, and existing disk device names will remain unaffected.

Solaris however composes disk device names (device aliases) of parts which identifies the controller, the target-id, the LUN-id, and finally the slice or partition on the disk.

I will use some examples to explain this. Looking at this device:

$ ls -lL /dev/dsk/c1t*

br--------   1 root     sys       27, 16 Jun  2 16:26 /dev/dsk/c1t0d0p0
br--------   1 root     sys       27, 17 Jun  2 16:26 /dev/dsk/c1t0d0p1
br--------   1 root     sys       27, 18 Jun  2 16:26 /dev/dsk/c1t0d0p2
br--------   1 root     sys       27, 19 Jun  2 16:26 /dev/dsk/c1t0d0p3
br--------   1 root     sys       27, 20 Jun  2 16:26 /dev/dsk/c1t0d0p4
br--------   1 root     sys       27,  0 Jun  2 16:26 /dev/dsk/c1t0d0s0
br--------   1 root     sys       27,  1 Jun  2 16:26 /dev/dsk/c1t0d0s1
br--------   1 root     sys       27, 10 Jun  2 16:26 /dev/dsk/c1t0d0s10
br--------   1 root     sys       27, 11 Jun  2 16:26 /dev/dsk/c1t0d0s11
br--------   1 root     sys       27, 12 Jun  2 16:26 /dev/dsk/c1t0d0s12
br--------   1 root     sys       27, 13 Jun  2 16:26 /dev/dsk/c1t0d0s13
br--------   1 root     sys       27, 14 Jun  2 16:26 /dev/dsk/c1t0d0s14
br--------   1 root     sys       27, 15 Jun  2 16:26 /dev/dsk/c1t0d0s15
br--------   1 root     sys       27,  2 Jun  2 16:26 /dev/dsk/c1t0d0s2
br--------   1 root     sys       27,  3 Jun  2 16:26 /dev/dsk/c1t0d0s3
br--------   1 root     sys       27,  4 Jun  2 16:26 /dev/dsk/c1t0d0s4
br--------   1 root     sys       27,  5 Jun  2 16:26 /dev/dsk/c1t0d0s5
br--------   1 root     sys       27,  6 Jun  2 16:26 /dev/dsk/c1t0d0s6
br--------   1 root     sys       27,  7 Jun  2 16:26 /dev/dsk/c1t0d0s7
br--------   1 root     sys       27,  8 Jun  2 16:26 /dev/dsk/c1t0d0s8
br--------   1 root     sys       27,  9 Jun  2 16:26 /dev/dsk/c1t0d0s9

We notice the following:

1. The entries exist as links under /dev/dsk, pointing to the device node files in the /devices tree. Actually every device has got a second instance under /dev/rdsk. The ones under /dev/dsk are "block" devices, used in a random-access manner, e.g for mounting file systems. The "raw" device links are character devices, used for low-level access functions (such as creating a new file system).

2. The device names all start with c1, indicating controller c1 - so basically all the entries above are on one controller.

3. The next part of the device name is the target-id, indicated by t0. This is determined by the SCSI target-id number set on the device, and not by the order in which disks are discovered. Any new disk added to this controller will have a new unique SCSI target number and so will not affect existing device names.

4. After the target number each disk has got a LUN-id number, in the example d0. This too is determined by the SCSI LUN-id provided by the device. Normal disks on a simple SCSI card all show up as LUN-id 0, but devices like arrays or jbods can present multiple LUNs on a target. (In such devices the target usually indicates the port number on the enclosure)

5. Finally each device identifies a partition or slice on the disk. Devices with names ending with a p# indicates a PC BIOS disk partition (sometimes called an fdisk or primary partition), and names ending with an s# indicates a Solaris slice.

This begs some more explaining. There are five device names ending with p0 through p4. The p0 device, eg c1t0d0p0, indicates the whole disk as seen by the BIOS. The c_t_d_p1 device is the first primary partition, with c_t_d_p2 being the second, etc. These devices represent all four of the allowable primary partitions, and always exists even when the partitions are not used.

In addition there are 16 devices with names ending with s0 though s15. These are Solaris "disk slices", and originate from the way disks are "partitioned" on SPARC systems. Essentially Solaris uses slices much like PCs use partitions - most Solaris disk admin utilities work with disk slices, not with fdisk or BIOS partitions.

The way the "disk" is sliced is stored in the Solaris VTOC, which resides in the first sector of the "disk". In the case of x86 systems, the VTOC exists inside one of the primary partitions, and in fact most disk utilities treats the Solaris partition as the actual disk. Solaris splits up the particular partition into "slices", thus the afore mentioned "disk slices" really refers to slices existing in a partition.

Note that Solaris disk slices are often called disk partitions, so the two can be easily confused - when documentation refers to partitions you need to make sure you understand whether PC BIOS partitions or Solaris Slices are implied. In generally if the documentation applies to SPARC hardware (as well as to x86 hardware), then partitions are Solaris slices (SPARC does not have an equivalent to the PC BIOS partition concept)

Example Disk Layout:

p1First primary Partition
p2Second primary Partition
p3
Solaris Type 0xBF or 0x80 Partition
s0Slice commonly used for root
s1Slice commonly used for swap
s2Whole disk (backup or overlap slice)
s3Custom use slice
s4Custom use slice
s5Custom use slice
s6Custom use slice, commonly /export
s7Custom use slice
s8Boot block
s9Alternates (2 cylinders)
s10x86 custom use slice
s11x86 custom use slice
s12x86 custom use slice
s13x86 custom use slice
s14x86 custom use slice
s15x86 custom use slice
p4
Extended partition
p5Example: Linux or data partition
p6Example: Linux or data partition
etcExample: Linux or data partition

Note that traditionally slice 2 "overlaps" the whole disk, and is commonly referred to as the backup slice, or slightly less commonly, called the overlap slice.

The ability to have slice numbers from 8 to 15 is x86 specific. By default slice 8 covers the area on the disk where the label, vtoc and boot record is stored. Slice 9 covers the area where the "alternates" data is stored - a two-cylinder area used to record information about relocated/errored sectors.

Another example of disk device entries:

$ ls -lL /dev/dsk/c0*

brw-r-----   1 root     sys      102, 16 Jul 14 19:45 /dev/dsk/c0d0p0
brw-r-----   1 root     sys      102, 17 Jul 14 19:45 /dev/dsk/c0d0p1
brw-r-----   1 root     sys      102, 18 Jul 14 19:45 /dev/dsk/c0d0p2
brw-r-----   1 root     sys      102, 19 Jul 14 19:12 /dev/dsk/c0d0p3
brw-r-----   1 root     sys      102, 20 Jul 14 19:45 /dev/dsk/c0d0p4
brw-r-----   1 root     sys      102,  0 Jul 14 19:45 /dev/dsk/c0d0s0
brw-r-----   1 root     sys      102,  1 Jul 14 19:45 /dev/dsk/c0d0s1
...
brw-r-----   1 root     sys      102,  8 Jul 14 19:45 /dev/dsk/c0d0s8
brw-r-----   1 root     sys      102,  9 Jul 14 19:45 /dev/dsk/c0d0s9

The above example is taken form an x86 system. Note the lack of a target number in the device names. This is particular to ATA hard drives on x86 systems. Besides that it works like normal device names I described above.

Below, comparing the block and raw device entries:

$ ls -l /dev/*dsk/c1t0d0p0

lrwxrwxrwx   1 root     root          49 Jun 26 16:22 /dev/dsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q
lrwxrwxrwx   1 root     root          53 Jun  2 16:18 /dev/rdsk/c1t0d0p0 -> ../../devices/pci@0,0/pci-ide@1f,2/ide@1/sd@0,0:q,raw

These look the same, except that the second one points to the raw device node.

For completeness' sake, some utilities used in managing disks:

format The work-horse, used to perform partitioning (including fdisk partitioning on x86 based systems), analyzing/testing the disk media for defects, tuning advanced SCSI parameters, and generally checking the status and health of disks.
rmformat Shows information about removable devices, formats media, etc.
prtvtoc Command-line utility to display information about disk geometry and more importantly, the contents of the VTOC in a human readable format, showing the layout of the Solaris slices on the disk.
fmthard Write or overwrite a VTOC on a disk. Its input format is compatible with the output produced by prtvtoc, so it is possible to copy the VTOC between two disks by means of a command like this:

prtvtoc /dev/rdsk/c1t0d0s2 | fmthard -s - /dev/rdsk/c1t1d0s2

This is obviously not meaningful if the second disk do not have enough space. If the disks are of different sizes, you can use something like this:

prtvtoc /dev/rdsk/c1t0d0s2 | awk '$1 != 2' | fmthard -s - /dev/rdsk/c1t1d0s2

The above awk command will cause the entry for slice 2 to be committed, and fmthard will then maintain the existing entry or, if none exists, create a default one on the target disk.

Also note, as implied above, Solaris slices can (and often do) overlap. Care needs to be taken to not have file systems on slices which overlap other slices.

iostat -En Show "Error" information about disks, and often very usefull, the firmware revisions and manufacturer's identifier strings.
format -e This reveals [enhanced|advanced] functionality, such as the cache option on SCSI disks.
format -Mm

Enable debugging output, particularly makes SCSI probe failures non-silent.

cfgadm and luxadm also deserves honorable mention here. These commands manage disk enclosures, detaching and attaching, etc. but are also used in managing some aspects of disks.

luxadm -e port

Show list of FC HBAs.

luxadm can also for example be used to set the beacon LED on individual disks in FCAL enclosures that support this function. The details are somewhat specific to the relevant enclosure.

cfgadm can be used to probe SAN connected subsystems, eg by doing:

cfgadm -c configure c2::XXXXXXXXXXXX

(where XXXXXXXXXXX is the enclosure port WWN, using controller c2)

Hopefully this gives you an idea about how disk device names, controller names, and partitions and slices all relate to one another.

11 comments:

JMoore said...

You left out the "other" kind of slice/partitioning information that's available on Solaris 10 (both x86 and Sparc): The EFI disklabel type. http://en.wikipedia.org/wiki/GUID_Partition_Table

EFI disklabels (sometimes called GUID Partition Tables or GPT) are the new non-legacy standard put together by Intel and several other BIOS manufacturers to solve some of the PC partitioning problems.

You can slice up a disk using EFI labels with the "format -e" command. When you label the disk, it will ask if you want to use SMI labels or EFI.

These show up in /dev/dsk/ as normal slices, with the exception that s8 exists and that s2 can not be the "whole disk" partition.

EFI partitions are not allowed to overlap (which prevents a common sysadmin headache) and support larger disks more easily than the SMI labels. They're also read and handled on more platforms: Linux, *BSD, HP-UX, MacOS, and even Windows can recognize disks labeled with EFI and respect that "somebody else's" data is on that disk. (another common sysadmin headace)

--Joe

Brian Leonard said...

Great Post! I've been confused by the differences between slices and partitions (especially when a saw that a particular device could have both) and yours is the first post I've seen that clearly explains this issue. One question, how can I tell which partition contains my slices (p3 in your example)?

Hartz said...

Hi Joe, indeed I did not mention EFI labels. Mostly I know next to nothing about them! As a point in case I just burned my fingers over these - turns out if you create a Solaris Partition, and a partition with an EFI type you should expect things to not work!

Hartz said...

Hello Brian

If you have only one Solaris fdisk partition on the disk you can go into format, select the disk, then select the fdisk menu. The fdisk partition table will display and from there you can see which partition is the Solaris partition.

If you are doing some kind of hockey-pokey with hidden partitions etc to be able to use multiple Solaris partitions, then I'd say you are on your own and better keep track of what is where on the disk rather carefully.

pumpkin said...

Great post! I was wondering how ZFS fits into all this? Does it not affect this labeling at all, or do you get other device labels based on zvols (if you use them) and so on?

Thanks!

Hartz said...

Hello pumpkin, thank you for your comment - I have to blog about ZFS, but so much good documentation and so many great blog entries about it exists already that it is hard to know what to write about.

The short answer is that ZFS operates at a different layer to disk devices. Essentially it Uses disks or disk slices - either method is possible though using disks is preferred because when ZFS manages the whole disk it will turn on the device's on-disk hardware Write-caching function. It will not do this if only a slice is given to ZFS because basically write-caching is a bad thing for other file system types (corruption may occur if any slices host file systems other than ZFS on a disk with write caching enabled).

About zvols: These are "block devices" created from space allocated in a ZFS disk pool. These can be used to store other filesystems (eg a UFS file system), or for Swap devices, etc, just like a raw disk slice can be used. You are however not allowed to create a new ZFS pool using a zvol :-)

rajeshkeralam said...

Beautiful Explanation...Very useful..loved it...!!! great. thanks.

rajeshkeralam said...

Could u pls explain how to make use of s10 to s15 fdisk partitions via Solaris format utility. Is it possible ??

Thomas said...

Wow! Really informative post. Thank you very much for clarification of the partitioning and the naming scheme! I'm from the Linux side and just trying to get a peek over the fence over to Solaris based systems.

Santosh said...

on a solaris 9 machine when SAN lun is assigned, how OS will decide the controller name for the disk without using veritas/svm or hbacmd/lputil command.

Johan Hartzenberg said...

The first time a Lun is seen, the OS will assign a name, like c2t3d0. If you have installed a new controlled, it will get a new number, eg c3... It will use a number that has not yet been assigned via /etc/path_to_inst. If it is on a new initiator (eg a new storage system on a SAN) then it will assign a new "target number" such as t3. The Lun-ID determines the next part, eg Lun-1 = d1, Lun-2 = d2.

Once configured for the first time, it is registered in /etc/path_to_inst, as well as the /etc/cfg/ files (for the target)