Bug #15688


blkdev and lofi need to properly init cmlb minors

Added by Robert Mustacchi 11 days ago. Updated 11 days ago.

driver - device drivers
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


A colleague discovered the following initial issue:

As part of installinator testing, a Gimlet boots and then uses devinfo to get a list of disks in the system. No ZFS pools are imported except for the boot time ramdisk pool. We construct a /devices path to the wd,raw minor for the disk we have seen in the devinfo tree, but when we go to open it, it isn't there! Other nodes from the disk are present, just not wd or wd,raw.

The disk in question is one of the M.2/BSU disks, which at the time we're looking at it has the following properties:

  • it is less than 2TiB in capacity
  • it has previously, in another boot with another image on the manufacturing station, where at least these things happened:
    • formatted with nvmeadm to use 4K sectors
    • had a particular EFI/GPT label created with our expected layout
    • had an image written to s0

So, critically, the disk is already labelled, but when we go to look at it, it hasn't been used for anything so far on this boot.

It turns out this condition is reproducible, if you:

# Attempt to convince things to detach if not unload:
for (( i = 0; i < 10; i++ )); do
        modunload -i 0

# Look for the disk device:
ls -l /devices/pci@0,0/pci1022,1483@1,3/pci1344,3100@0/

What we saw each time we did this was that all of the nodes we expected were there (a and a,raw, through u and u,raw). In addition, we saw a node that should not be there for an EFI label: h and h,raw. We did not see wd or wd,raw.

You can make the correct nodes show up by running diskinfo at least once between modunload and ls. We did not need to inject a sleep, the natural delay was enough for it to show up by the time we looked.

Running diskinfo much later than attach time also makes the node show up at that time. Though the /devices file system only seems to have second-granularity timestamps, if you wait a few seconds between attach and diskinfo you can definitely see that the wd* nodes are created later than all the others.

I haven't nailed this down, but a quick look in cmlb_create_minor_nodes() suggests that there is a logic bug here. There are three properties on the struct cmlb_lun that track a label type:

  • cl_def_labeltype, the "default" label type, set in cmlb_attach() to CMLB_LABEL_VTOC iff the device capacity is less than 2TiB (which it is for the M.2, but not for the U.2 devices!)
  • cl_cur_labeltype, the current label type, set during attach to CMLB_LABEL_UNDEF
  • cl_last_labeltype the previous label type, set during attach to CMLB_LABEL_UNDEF, and then set at the end of cmlb_create_minor_nodes()

cmlb_create_minor_nodes() is run at various points: during attach (which is what gets driven by our walk of the devinfo tree), but also later, during label validation and presumably other media-inspection operations (that diskinfo performs, for instance). There is logic in the routine that tries to do things on the "first time", and also tries to do things when a device is switched between label types. Critically, the h,raw minor appears for VTOC labels and disappears for EFI labels (the inverse of the wd,raw minor), and this disk definitely has an EFI label.

I expect this just needs to be a bit more explicit in its tracking of this state, especially from a cold start.

From here, I took over and found the following:

It was subsequently reported that the workaround of simply running diskinfo was insufficient. Here's a bit of background. The cmlb module will go through and create minor nodes at a few different points in time:

  • When cmlb_attach() is called, which is used by disk providers such as sd, blkdev, etc.
  • When cmlb_validate() is called, which is done at various points in time by drivers such as attach, open, etc.
  • When cmlb_close() is called on removable media, this is used to reset the state of the world.

There are basically three possible states that a device can be in for labels internally to the driver, which is kept in the cl_cur_labeltype member.

  • CMLB_LABEL_EFI indicating that there is an EFI lablel.
  • CMLB_LABEL_VTOC indicating that there is a VTOC label.
  • CMLB_LABEL_UNDEF indicating that the label is not currently known. When the label is not known, then the driver will create minor nodes based on the underlying capacity of the device, basically whether or not the disk exceeds 2 TiB worth of blocks.

The important piece here is now when do we actually update the value of cl_cur_labeltype. The answer to that is that we only do so after a call to cmlb_validate(). So this means that the behavior we're seeing is because we haven't actually called cmlb_validate() on this disk ever. In this case, blkdev only ever calls cmlb_validate() in response to a call to open. If we look at other drivers, this and lofi are an exception. In particular, sd, xdf, cmdk, and dada all actually always call the validate entry point in attach. A large number of these even just ignore the return value, but rather are calling this (whether intentionally or not),

I believe that what's happened here is that cmlb is expecting there to be actions taken between the time the handle is attached and when things are ready for action, which is why it doesn't call this directly itself. Of these drivers, it's only the two newer ones that are missing this, blkdev and lofi (this was added with labeled lofi support in #6602. The fix here is straightforward. We need to place a call to cmlb_validate() while in the blkdev attach routine and the lofi driver's lofi_create_minor_nodes() routine which is where it creates the cmlb handle.

To help verify this, I tried to look at what was actually making calls to bd_open and while most other devices had items found or zpool did enough work to find everything and make it work, nothing would have ever touched this particular M.2. The reason only one of the two M.2s was touched on my system was because pilot was being used to set up a dump device.

With the proposed fix in place, which I think makes sense, we were able to find get this working and on a fresh boot I saw the wd,raw devices and related. Prior to this I always saw the original pathology.

While testing this with lofi, I noticed incidentally that lofi didn't default to have a set of EFI labels when I created a volume that was larger than 2T (I used mkfile to create a sparse file for this). However, the naive lofi fix is not sufficient here. This is because of how lofi designs itself and that it doesn't want to perform I/O at this initial time. Because of this. labeled lofi also doesn't actually have the right default label set and will always be VTOC because we can't get the right size.

From using dtrace it appears that this was working a bit by fluke as we got a call to cmlb_validate() via a syseventd plugin, which isn't something we want to rely upon. Absent a fix for the above (which we are not doing at this time), it appears that this isn't strictly changing things, but will be helpful to be correct regardless.

To test this I manually verified on an Oxide platform that we had the right minors being created for blkdev devices. Incidentally I also noticed on debug builds we didn't have disappearing M.2 devices any more. I also on a stock i86pc system manually tested a labeled lofi device with both a 10g and something closer to 3000g configurations to ensure that we tested both label cases. I also used lofiadm -l and manually labeled a device and verified that the right minors were created.

Actions #1

Updated by Electric Monk 11 days ago

  • Gerrit CR set to 2877

Also available in: Atom PDF