Project

General

Profile

Bug #12783

async unlinked drain races with ZFS unmount

Added by Andy Fiddaman 4 days ago. Updated about 16 hours ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

We've received some reports of `pkg update` failing on OmniOS because the BE cannot be cloned.

# pkg update -v --be-name=ooce-034-20200522
            Packages to update:         4
     Estimated space available:  38.05 GB
Estimated space to be consumed: 111.01 MB
       Create boot environment:       Yes
     Activate boot environment:       Yes
Create backup boot environment:        No
          Rebuild boot archive:        No

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                4/4       112/112    22.0/22.0      --

pkg: Unable to clone the current boot environment.
Exit 1

These reports started after the 3rd of December but have been pretty sporadic and the problem has usually resolved itself before it can be properly investigated. However, today an OmniOS user reported a system where it was happening more often than not which presented a chance to investigate.

Here's the update again with BE_PRINT_ERR=true in the environment:

DOWNLOAD                                PKGS         FILES    XFER (MB)   SPEED
Completed                                4/4       112/112    22.0/22.0      --

be_unmount_root: failed to unmount BE root dataset rpool/ROOT/ooce-034-20200505: Device busy
be_copy: failed to unmount newly created BE
be_copy: destroying partially created boot environment
pkg: Unable to clone the current boot environment.

To make this simpler to reproduce, I wrote a small test program:

#include <stdio.h>
#include <libbe.h>

int
main()
{
        int ret;

        putenv("BE_PRINT_ERR=true");

        for (;;) {
                nvlist_t *cattrs, *dattrs;
                char *p = NULL;

                nvlist_alloc(&cattrs, NV_UNIQUE_NAME, 0);
                nvlist_alloc(&dattrs, NV_UNIQUE_NAME, 0);

                printf("Copying BE\n");
                ret = be_copy(cattrs);

                if (ret != BE_SUCCESS) {
                        fprintf(stderr, "- Copy failed: %d\n", ret);
                        break;
                }

                if (nvlist_lookup_string(cattrs,
                    BE_ATTR_NEW_BE_NAME, &p) != 0) {
                        fprintf(stderr, "- No new BE name\n");
                        break;
                }

                printf("- Got new BE name: %s\n", p);

                nvlist_add_string(dattrs, BE_ATTR_ORIG_BE_NAME, p);

                printf("Destroying BE\n");
                ret = be_destroy(dattrs);

                if (ret != BE_SUCCESS) {
                        fprintf(stderr, "- Destroy failed: %d\n", ret);
                        break;
                }

                nvlist_free(cattrs);
                nvlist_free(dattrs);
        }

        return 0;
}

Running this in a loop on the affected system always replicates the problem within three iterations.

I added a small (500ms) delay at the start of zfs_umount() using dtrace, and the problem went away.

Looking at the code, zfs_umount() returns EBUSY if there are active vnodes in the filesystem.
I used the following dtrace script to monitor calls to zfs_zget() on the filesystem while in zfs_umount()

#!/usr/sbin/dtrace -Cs

vfs_t *v;

fbt::zfs_umount:entry
{
        v = (vfs_t *)arg0;
}

fbt::zfs_umount:return
{
        v = 0;
}

fbt::zfs_zget:entry
/((zfsvfs_t *)arg0)->z_vfs == v/
{
        @[stack()] = count();
}

and that showed the problem:

root@nfs:/home/andyf# ./d -c ./beclone
dtrace: script './d' matched 3 probes
Copying BE
- Got new BE name: ooce-034-20200505
Destroying BE
dtrace: pid 9109 has exited

              zfs`zfs_unlinked_drain_task+0x73
              genunix`taskq_d_thread+0xbc
              unix`thread_start+0xb
             2390

root@nfs:/home/andyf# ./d -c ./beclone
dtrace: script './d' matched 3 probes
Copying BE
be_unmount_root: failed to unmount BE root dataset rpool/ROOT/ooce-034-20200505: Device busy
be_copy: failed to unmount newly created BE
be_copy: destroying partially created boot environment
- Copy failed: 4040
dtrace: pid 9106 has exited

              zfs`zfs_unlinked_drain_task+0x73
              genunix`taskq_d_thread+0xbc
              unix`thread_start+0xb
             2660

The asynchronous unlinked drain task is busily grabbing and releasing vnodes during the unmount.
This races with the checks for active vnodes and can cause zfs_umount() to fail with EBUSY

Asynchronous unlinked drain came in with 12002 on Tue Nov 26 14:27:07 2019 -0700 so it ties in with the date on which we got the first report.


Related issues

Related to illumos gate - Bug #12002: async unlinked drainClosed

Actions

History

#1

Updated by Andy Fiddaman 4 days ago

  • Related to Bug #12002: async unlinked drain added
#2

Updated by Andy Fiddaman 4 days ago

To test this, I used the following script to create a ZFS filesystem which has 10000 entries in the unlink list:

#!/bin/ksh

testfs=rpool/test
mp=/rpool/test

num=10000

mdb -kwe 'zfs_unlink_suspend_progress/z0'

zfs destroy $testfs
set -e
zfs create -o mountpoint=$mp $testfs

for f in `seq 1 $num`; do
        mkfile 128k $mp/file-$f
done

mdb -kwe 'zfs_unlink_suspend_progress/z1'

for f in `seq 1 $num`; do
        exec 6<$mp/file-$f
        rm $mp/file-$f
done

Then, used this D script to monitor and to blow up the race window a bit.

#!/usr/sbin/dtrace -CFws

fbt:zfs:zfs_mount: {}
fbt:zfs:zfs_umount:entry {
        v = (vfs_t *)arg0;
        vv = (zfsvfs_t *)v->vfs_data;
        printf("Draining %d", vv->z_draining);
}
fbt:zfs:zfs_umount:return {}
fbt:zfs:zfsvfs_teardown: {}
fbt:zfs:zfs_unlinked_drain: {}
fbt:zfs:zfs_unlinked_drain_stop_wait: {}

fbt:zfs:zfs_unlinked_drain_task:entry {
        self->t++;
        self->v = 0;
        printf("Task started");
}
fbt:zfs:zfs_unlinked_drain_task:return {
        self->t--;
        printf("Drained %d nodes", self->v);
}
fbt:zfs:zfs_zget:entry/self->t/ {
        chill(500);
        self->v++;
}
fbt:zfs:zfs_zget:return/self->t/ {}

Unmounting the filesystem works and shows the expected code path since the unlinked list drain task is not running:

CPU FUNCTION
  3  -> zfs_umount                            Draining 0
  3    -> zfsvfs_teardown
  3      -> zfs_unlinked_drain_stop_wait
  3      <- zfs_unlinked_drain_stop_wait
  3    <- zfsvfs_teardown
  3  <- zfs_umount

Mounting again:

 12  -> zfs_mount
 12    -> zfs_unlinked_drain
 12    <- zfs_unlinked_drain
  4    -> zfs_unlinked_drain_task             Task started
 12    <- zfs_mount
  4  <- zfs_unlinked_drain_task               Drained 10000 nodes

Note that the zfs_unlink_suspend_progress is still set so the unlinked list is still in place for the next mount even though it has been scanned.

Now quickly mount and unmount to replicate the original problem case:

# zfs mount rpool/test; zfs umount rpool/test

 13  -> zfs_mount
 13    -> zfs_unlinked_drain
 13    <- zfs_unlinked_drain
  3    -> zfs_unlinked_drain_task             Task started
 13    <- zfs_mount
  9  -> zfs_umount                            Draining 1
  9  -> zfs_unlinked_drain_stop_wait
  3  <- zfs_unlinked_drain_task               Drained 364 nodes
  9  <- zfs_unlinked_drain_stop_wait
  9  -> zfsvfs_teardown
  9    -> zfs_unlinked_drain_stop_wait
  9    <- zfs_unlinked_drain_stop_wait
  9  <- zfsvfs_teardown
  9  <- zfs_umount

In this case, the list was still draining when zfs_umount() was called so the call was made to stop the task before proceeding.

Now try this again but have a file open on the filesystem before unmounting:

# zfs mount rpool/test; cd /rpool/test; zfs umount rpool/test
cannot unmount '/rpool/test': Device busy

  3  -> zfs_mount
  3    -> zfs_unlinked_drain
  3    <- zfs_unlinked_drain
 11    -> zfs_unlinked_drain_task             Task started
  3    <- zfs_mount
  6  -> zfs_umount                            Draining 1
  6  -> zfs_unlinked_drain_stop_wait
 11  <- zfs_unlinked_drain_task               Drained 417 nodes
  6  <- zfs_unlinked_drain_stop_wait
  6  -> zfs_unlinked_drain
  6  <- zfs_unlinked_drain
  6  <- zfs_umount
 14  -> zfs_unlinked_drain_task               Task started
 14  <- zfs_unlinked_drain_task               Drained 10000 nodes

This shows the task being stopped for the active vnode check, and then restarted when the unmount fails
The reason that the second task run says 10000 nodes is because the tunable it still turned off. Let's stop suspending activity there:

# mdb -kwe 'zfs_unlink_suspend_progress/z0'
zfs_unlink_suspend_progress:    0x1             =       0x0
# zfs mount rpool/test; cd /rpool/test; zfs umount rpool/test
/rpool/test
cannot unmount '/rpool/test': Device busy

 15  -> zfs_mount
 15    -> zfs_unlinked_drain
 15    <- zfs_unlinked_drain
  2    -> zfs_unlinked_drain_task             Task started
 15    <- zfs_mount
 13  -> zfs_umount                            Draining 1
 13  -> zfs_unlinked_drain_stop_wait
  2  <- zfs_unlinked_drain_task               Drained 146 nodes
 13  <- zfs_unlinked_drain_stop_wait
 13  -> zfs_unlinked_drain
 13  <- zfs_unlinked_drain
 13  <- zfs_umount
  6  -> zfs_unlinked_drain_task               Task started
 11  <- zfs_unlinked_drain_task               Drained 9854 nodes

For a final test, reset the filesystem, un-suspend processing and then run mount, umount in a loop until the queue is drained. This should happen in chunks.

# while :; do
zfs mount rpool/test; zfs umount rpool/test
done

  5  <- zfs_unlinked_drain_task               Drained 165 nodes
 15  <- zfs_unlinked_drain_task               Drained 152 nodes
 10  <- zfs_unlinked_drain_task               Drained 140 nodes
 13  <- zfs_unlinked_drain_task               Drained 163 nodes
 11  <- zfs_unlinked_drain_task               Drained 161 nodes
 14  <- zfs_unlinked_drain_task               Drained 160 nodes
 14  <- zfs_unlinked_drain_task               Drained 166 nodes
  9  <- zfs_unlinked_drain_task               Drained 161 nodes
  8  <- zfs_unlinked_drain_task               Drained 145 nodes
 14  <- zfs_unlinked_drain_task               Drained 164 nodes
  7  <- zfs_unlinked_drain_task               Drained 144 nodes
  5  <- zfs_unlinked_drain_task               Drained 155 nodes
  1  <- zfs_unlinked_drain_task               Drained 166 nodes
  4  <- zfs_unlinked_drain_task               Drained 154 nodes
  1  <- zfs_unlinked_drain_task               Drained 167 nodes
  2  <- zfs_unlinked_drain_task               Drained 153 nodes
  3  <- zfs_unlinked_drain_task               Drained 145 nodes
  1  <- zfs_unlinked_drain_task               Drained 170 nodes
  4  <- zfs_unlinked_drain_task               Drained 171 nodes
  1  <- zfs_unlinked_drain_task               Drained 153 nodes
  5  <- zfs_unlinked_drain_task               Drained 159 nodes
 15  <- zfs_unlinked_drain_task               Drained 158 nodes
  1  <- zfs_unlinked_drain_task               Drained 168 nodes
  1  <- zfs_unlinked_drain_task               Drained 171 nodes
  1  <- zfs_unlinked_drain_task               Drained 160 nodes
  2  <- zfs_unlinked_drain_task               Drained 144 nodes
  6  <- zfs_unlinked_drain_task               Drained 152 nodes
  0  <- zfs_unlinked_drain_task               Drained 159 nodes
  2  <- zfs_unlinked_drain_task               Drained 149 nodes
  0  <- zfs_unlinked_drain_task               Drained 171 nodes
  3  <- zfs_unlinked_drain_task               Drained 138 nodes
  1  <- zfs_unlinked_drain_task               Drained 217 nodes
  4  <- zfs_unlinked_drain_task               Drained 165 nodes
  1  <- zfs_unlinked_drain_task               Drained 152 nodes
  7  <- zfs_unlinked_drain_task               Drained 159 nodes
  2  <- zfs_unlinked_drain_task               Drained 143 nodes
  1  <- zfs_unlinked_drain_task               Drained 156 nodes
  4  <- zfs_unlinked_drain_task               Drained 151 nodes
  6  <- zfs_unlinked_drain_task               Drained 150 nodes
  2  <- zfs_unlinked_drain_task               Drained 170 nodes
  2  <- zfs_unlinked_drain_task               Drained 158 nodes
  0  <- zfs_unlinked_drain_task               Drained 171 nodes
  2  <- zfs_unlinked_drain_task               Drained 144 nodes
  4  <- zfs_unlinked_drain_task               Drained 187 nodes
  3  <- zfs_unlinked_drain_task               Drained 140 nodes
  1  <- zfs_unlinked_drain_task               Drained 176 nodes
  3  <- zfs_unlinked_drain_task               Drained 162 nodes
  1  <- zfs_unlinked_drain_task               Drained 170 nodes
  4  <- zfs_unlinked_drain_task               Drained 162 nodes
  1  <- zfs_unlinked_drain_task               Drained 165 nodes
  4  <- zfs_unlinked_drain_task               Drained 163 nodes
  1  <- zfs_unlinked_drain_task               Drained 172 nodes
  0  <- zfs_unlinked_drain_task               Drained 162 nodes
  1  <- zfs_unlinked_drain_task               Drained 169 nodes
  1  <- zfs_unlinked_drain_task               Drained 173 nodes
  6  <- zfs_unlinked_drain_task               Drained 178 nodes
  5  <- zfs_unlinked_drain_task               Drained 160 nodes
  5  <- zfs_unlinked_drain_task               Drained 158 nodes
  1  <- zfs_unlinked_drain_task               Drained 156 nodes
  1  <- zfs_unlinked_drain_task               Drained 161 nodes
  3  <- zfs_unlinked_drain_task               Drained 152 nodes
  1  <- zfs_unlinked_drain_task               Drained 157 nodes
  0  <- zfs_unlinked_drain_task               Drained 57 nodes
  2  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  7  <- zfs_unlinked_drain_task               Drained 0 nodes
  2  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  3  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  3  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  7  <- zfs_unlinked_drain_task               Drained 0 nodes
  3  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  4  <- zfs_unlinked_drain_task               Drained 0 nodes
  2  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  1  <- zfs_unlinked_drain_task               Drained 0 nodes
  0  <- zfs_unlinked_drain_task               Drained 0 nodes
#3

Updated by Andy Fiddaman about 16 hours ago

  • Gerrit CR set to 694

Also available in: Atom PDF