Project

General

Profile

Actions

Bug #13194

closed

null/dangling pointer deref somewhere under dmu_objset_upgrade

Added by Alex Wilson about 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:
External Bug:

Description

After the integration of #11479 zfs project support, we've had machines hit panics shortly after upgrade (and re-hit similar panics after rebooting). Some machines hit it fairly reliably early on in startup, while others hit it much later while recv'ing zfs send streams.

The panic varies, but often looks like:

> ::status
debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c
operating system: 5.11 joyent_20190919T022916Z (i86pc)
git branch: master
git rev: 945e7932e079324beb3dba245ce6bc8f401695da
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
dump content: kernel pages only

> $C
fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198)
fffffbe364161c10 thread_start+0xb()

Sometimes stacks appear which involve dmu_objset_upgrade_task_cb as well. Dump following the panic usually succeeds.

We haven't been able to reproduce this on any non-production systems, which makes debugging more challenging. We also have observed permanent data corruption in ZFS pools following this panic on two machines, meaning we are very loathe to reproduce it further in production. I'll add my initial analysis of one of the dumps as a comment.

When we clobber the upgrade process that was introduced for #11479 by adding return; at the first line of dmu_objset_upgrade, these panics do not occur and our systems run as normal. We have patched all of our production illumos systems in this way since late last year to avoid this panic.


Related issues

Related to illumos gate - Bug #11479: zfs project supportClosedJerry Jelinek

Actions
Related to illumos gate - Bug #13358: dmu_objset_upgrade_stop() needs to waitClosedJason King

Actions
Actions #1

Updated by Alex Wilson about 3 years ago

[root@24-6e-96-31-df-4c (gps-1) /var/crash/volatile]# mdb {unix,vmcore}.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm sd lofs idm mpt_sas sata crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ]

> ::status
debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c
operating system: 5.11 joyent_20190919T022916Z (i86pc)
git branch: master
git rev: 945e7932e079324beb3dba245ce6bc8f401695da
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
dump content: kernel pages only

> $C
fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198)
fffffbe364161c10 thread_start+0xb()

(we jumped to null, so the last stack frame is gonna be bonkers and not come up here)

> ::regs
%rax = 0x0000000000000000                 %r9  = 0xfffffbe36433cc20
%rbx = 0xfffffdf4c28e49c0                 %r10 = 0x0000000000000000
%rcx = 0x0000000000000003                 %r11 = 0x0000000000000000
%rdx = 0xfffffbe364161c20                 %r12 = 0xfffffdf4c28e4c68
%rsi = 0xfffffdf4befadf18                 %r13 = 0x0000000000000001
%rdi = 0xfffffdf4c28e49c0                 %r14 = 0xfffffdf4c59fd1c0
%r8  = 0xfffffdf45d918000                 %r15 = 0xfffffbe364161ba0

%rip = 0x0000000000000000
%rbp = 0xfffffbe364161b40
%rsp = 0xfffffbe364161b18
%rflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,IF,tf,sf,ZF,af,PF,cf>

%cs = 0x0030    %ds = 0x004b    %es = 0x004b    %fs = 0x0000
%gs = 0x01c3    %gsbase = 0xfffffdf45ceb6000    %kgsbase = 0x0
%trapno = 0xe   %err = 0x10     %cr2 = 0x0      %cr3 = 0x1f800000

dump the stack out a bit:

> 0xfffffbe364161b18,10/nap
0xfffffbe364161b18:
0xfffffbe364161b18:             dmu_objset_upgrade_task_cb+0x4d     # so we probably got into this function
0xfffffbe364161b20:             0xfffffdf4c59fd198
0xfffffbe364161b28:             0xfffffdf4c59fd1b8
0xfffffbe364161b30:             0xfffffbe364161b40
0xfffffbe364161b38:             0xfffffdf4c28e49c0
0xfffffbe364161b40:             0xfffffbe364161c00
0xfffffbe364161b48:             taskq_thread+0x2d0
0xfffffbe364161b50:             0xbaddcafebaddcafe
0xfffffbe364161b58:             0xfffffdf4befadf18                  # this is probably the taskq_ent_t
0xfffffbe364161b60:             0x1408b1cace2585
0xfffffbe364161b68:             0xfffffdf4c59fd1ca
0xfffffbe364161b70:             0xfffffdf4c59fd1c8
0xfffffbe364161b78:             0xfffffdf4c59fd1cc
0xfffffbe364161b80:             1
0xfffffbe364161b88:             0xfffffdf4c59fd208
0xfffffbe364161b90:             0x15d419000

> 0xfffffdf4befadf18::print -t taskq_ent_t
taskq_ent_t {
    struct taskq_ent *tqent_next = 0xfffffdf4bea8e598
    struct taskq_ent *tqent_prev = 0xfffffdf4c59fd208
    task_func_t *tqent_func = dmu_objset_upgrade_task_cb
    void *tqent_arg = 0xfffffdf4c28e49c0
    union  tqent_un = {
        taskq_bucket_t *tqent_bucket = 0
        uintptr_t tqent_flags = 0
    }
    kthread_t *tqent_thread = 0
    kcondvar_t tqent_cv = {
        ushort_t _opaque = 0
    }
}

> dmu_objset_upgrade_task_cb+0x4d::dis
dmu_objset_upgrade_task_cb+0x1e:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x21:call   +0x3a43a0a       <mutex_enter>
dmu_objset_upgrade_task_cb+0x26:movl   0x2c0(%rbx),%r10d
dmu_objset_upgrade_task_cb+0x2d:movl   $0x4,0x2c4(%rbx)
dmu_objset_upgrade_task_cb+0x37:testl  %r10d,%r10d
dmu_objset_upgrade_task_cb+0x3a:jne    +0x1f    <dmu_objset_upgrade_task_cb+0x5b>
dmu_objset_upgrade_task_cb+0x3c:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x3f:call   +0x3a43b0c       <mutex_exit>
dmu_objset_upgrade_task_cb+0x44:movq   %rbx,%rdi
dmu_objset_upgrade_task_cb+0x47:call   *0x2b8(%rbx)                        # looks like this is the deref that sent us to pc=0
dmu_objset_upgrade_task_cb+0x4d:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x50:movl   %eax,0x2c4(%rbx)
dmu_objset_upgrade_task_cb+0x56:call   +0x3a439d5       <mutex_enter>
dmu_objset_upgrade_task_cb+0x5b:movl   $0x1,0x2c0(%rbx)
dmu_objset_upgrade_task_cb+0x65:movq   $0x0,0x2b0(%rbx)
dmu_objset_upgrade_task_cb+0x70:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x73:call   +0x3a43ad8       <mutex_exit>
dmu_objset_upgrade_task_cb+0x78:movq   -0x20(%rbp),%rbx
dmu_objset_upgrade_task_cb+0x7c:movq   -0x18(%rbp),%r12
dmu_objset_upgrade_task_cb+0x80:leave
dmu_objset_upgrade_task_cb+0x81:ret

> ::offsetof objset_t os_upgrade_cb
offsetof (objset_t, os_upgrade_cb) = 0x2b8, sizeof (...->os_upgrade_cb) = 8

> 0xfffffdf4c28e49c0::print objset_t os_upgrade_cb
os_upgrade_cb = 0

the rest of this objset_t looks bonkers too:

objset_t {
    struct dsl_dataset *os_dsl_dataset = 0xd000d000d000d
    spa_t *os_spa = 0xd000d000d000d
    arc_buf_t *os_phys_buf = 0xd000d000d000d
    objset_phys_t *os_phys = 0xd000d000d000d
    boolean_t os_encrypted = 0xd000d (B_TRUE|0xd000c)
    dnode_handle_t os_meta_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    dnode_handle_t os_userused_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    dnode_handle_t os_groupused_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    dnode_handle_t os_projectused_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    zilog_t *os_zil = 0xd000d000d000d
    list_node_t os_evicting_node = {
        struct list_node *list_next = 0xd000d000d000d
...
        uint64_t zh_flags = 0xd000d000d000d
        uint64_t zh_claim_lr_seq = 0xd000d000d000d
        uint64_t [3] zh_pad = [ 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d ]
    }
    multilist_t *os_synced_dnodes = 0xd000d000d000d
    uint64_t os_flags = 0xd000d000d000d
    uint64_t os_freed_dnodes = 0xd000d000d000d
    boolean_t os_rescan_dnodes = 0xd000d (B_TRUE|0xd000c)
    boolean_t os_raw_receive = 0xd000d (B_TRUE|0xd000c)
    boolean_t [4] os_next_write_raw = [ 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c) ]
    kmutex_t os_obj_lock = {
        void *[1] _opaque = [ 0xd000d000d000d ]
    }
    uint64_t os_obj_next_chunk = 0xd000d000d000d
    uint64_t *os_obj_next_percpu = 0xd000d000d000d
    int os_obj_next_percpu_len = 0xd000d
    kmutex_t os_lock = {
        void *[1] _opaque = [ 0xd000d000d000d ]
    }
    multilist_t *[4] os_dirty_dnodes = [ 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d ]
    list_t os_dnodes = {
        size_t list_size = 0xd000d000d000d
        size_t list_offset = 0xd000d000d000d
        struct list_node list_head = {
            struct list_node *list_next = 0xd000d000d000d
            struct list_node *list_prev = 0xd000d000d000d
        }
    }
    list_t os_downgraded_dbufs = {
        size_t list_size = 0xd000d000d000d
        size_t list_offset = 0xd000d000d000d
        struct list_node list_head = {
            struct list_node *list_next = 0xd000d000d000d
            struct list_node *list_prev = 0xd000d000d000d
        }
    }
    kmutex_t os_userused_lock = {
        void *[1] _opaque = [ 0xd ]
    }
    kmutex_t os_user_ptr_lock = {
        void *[1] _opaque = [ 0 ]
    }
    void *os_user_ptr = 0
    sa_os_t *os_sa = 0x50
    kmutex_t os_upgrade_lock = {
        void *[1] _opaque = [ 0 ]
    }
    taskqid_t os_upgrade_id = 0
    dmu_objset_upgrade_cb_t os_upgrade_cb = 0
    boolean_t os_upgrade_exit = 0 (0)
    int os_upgrade_status = 0x4
}
Actions #2

Updated by Alex Wilson about 3 years ago

Some further observations:

  • This seems to occur most readily on machines which are doing lots of metadata operations (like creating and destroying snapshots or datasets)
  • The path which sends us here (dmu_objset_upgrade) is called during mount, so if datasets are set not to auto-mount, they can't seem to trigger it
  • We've had this occur on both SmartOS and OmniOS(ce)
  • We've observed this periodically ever since #11479 was integrated. This is not a recent new issue. We patched all of our production systems as of early this year to neuter the dataset upgrade code and have only observed it since then when machines were accidentally booted to an un-patched kernel.
  • Timing, stack contents, and specifics are quite variable, so I suspect the underlying bug might be a race or have a strong timing component
  • We have 3 different dumps of this issue available, but they all contain potentially sensitive data from production systems, which makes it difficult to share them.
Actions #3

Updated by Alex Wilson about 3 years ago

An example of one of the machines with permanent data corruption after hitting this:

[root@90-e2-ba-d3-ec-a0 (gps-1) ~]# zpool status -v
  pool: zones
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:32:47 with 1 errors on Tue Jul 21 08:36:53 2020
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            c1t5000C500862213D5d0  ONLINE       0     0     0
            c2t5000C500862206BDd0  ONLINE       0     0     0
            c3t5000C50085402429d0  ONLINE       0     0     0
            c4t5000C50085CEBDE5d0  ONLINE       0     0     0
        logs    
          mirror-1                 ONLINE       0     0     0
            c0t55CD2E404B7065F4d0  ONLINE       0     0     0
            c0t55CD2E404B707EDCd0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /var/run/agents-installer.sh

Observations:

  • The permanent data corruption seems to hit the oldest files on the pool (perhaps even just the oldest file on the pool)
  • It persists after scrubbing 10+ times and seems to be truly permanent corruption.
  • In this particular case on smartos it's very annoying to delete this file, since /var is mounted over later.
Actions #4

Updated by Marcel Telka about 3 years ago

  • Related to Bug #11479: zfs project support added
Actions #5

Updated by Alex Wilson about 3 years ago

  • Subject changed from null/dangling pointer deref somewhere under spa_upgrade_taskq to null/dangling pointer deref somewhere under dmu_objset_upgrade
Actions #6

Updated by Andy Fiddaman almost 3 years ago

A couple of other OmniOS users have recently hit this, and I think it is the same as https://github.com/openzfs/zfs/issues/5295

I'm pulling the fix over from OpenZFS and I'll also try and replicate using the details in that ticket.

Actions #7

Updated by Electric Monk almost 3 years ago

  • Gerrit CR set to 1087
Actions #8

Updated by Andy Fiddaman almost 3 years ago

Details from one of my crash dumps:

> ::status
debugging crash dump vmcore.0 (64-bit) from xxxx
operating system: 5.11 omnios-r151036-4a32ffb911 (i86pc)
build version: gfx-drm - heads/master-0-gbdc58b1-dirty
illumos-kvm - heads/r151036-0-g1fba1a9
heads/r151036-0-g4a32ffb911

image uuid: 19459626-713d-c102-f2b6-bd6b4f14d7eb
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe00f910fa30 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
dump content: kernel pages only
>
> $C
fffffe00f910fc00 taskq_thread+0x2cd(fffffeb2002d51d0)
fffffe00f910fc10 thread_start+0xb()
>
> fffffe00f910fa30::whatis
fffffe00f910fa30 is in thread fffffe00f910fc20's stack
> fffffe00f910fc20::findstack
stack pointer for thread fffffe00f910fc20 (tq:z_upgrade): fffffe00f910e6d0
  fffffe00f910fc20 0x15()
>
> ::spa
ADDR                 STATE NAME
fffffeb1d409e000    ACTIVE xxxx
fffffeb1b989d000    ACTIVE rpool
> fffffeb1d409e000::print spa_t spa_upgrade_taskq | ::taskq -t
ADDR             NAME                             ACT/THDS Q'ED  MAXQ INST
fffffeb2002d51d0 z_upgrade                          1/   1  216     -    -
    THREAD           STATE    SOBJ                COUNT
    fffffe00f910fc20 PANIC    <NONE>                  1
                     0x15
> <rsp,5/nap
0xfffffe00f910fb28:
0xfffffe00f910fb28:             dmu_objset_upgrade_task_cb+0x51
0xfffffe00f910fb30:             1
0xfffffe00f910fb38:             0xfffffeb2002d51f0
0xfffffe00f910fb40:             0xfffffe00f910fb50
0xfffffe00f910fb48:             0xfffffeb22782d080
> <rdi::whatis
fffffeb22782d080 is freed from kmem_alloc_768
Actions #9

Updated by Andy Fiddaman almost 3 years ago

With this change in place, I ran a full ZFS testsuite under a DEBUG kernel. The results are:

Results Summary
PASS     1116
FAIL      38
SKIP     127

Running Time:   04:47:15
Percent passed: 87.1%
:% rg -F '[FAIL]' log | cut -d/ -f5-
functional/acl/nontrivial/zfs_acl_chmod_inherit_003_pos (run as root) [01:42] [FAIL]
functional/bootfs/bootfs_006_pos (run as root) [00:06] [FAIL]
functional/clean_mirror/setup (run as root) [00:01] [FAIL]
functional/cli_root/zfs_mount/zfs_mount_encrypted (run as root) [00:04] [FAIL]
functional/cli_root/zfs_property/zfs_written_property_001_pos (run as root) [00:12] [FAIL]
functional/cli_root/zpool_add/setup (run as root) [00:02] [FAIL]
functional/cli_root/zpool_create/setup (run as root) [00:02] [FAIL]
functional/cli_root/zpool_destroy/zpool_destroy_001_pos (run as root) [00:02] [FAIL]
functional/cli_root/zpool_expand/zpool_expand_001_pos (run as root) [00:29] [FAIL]
functional/cli_root/zpool_import/setup (run as root) [00:02] [FAIL]
functional/cli_root/zpool_initialize/zpool_initialize_online_offline (run as root) [00:05] [FAIL]
functional/cli_root/zpool_initialize/zpool_initialize_verify_initialized (run as root) [00:09] [FAIL]
functional/cli_root/zpool_remove/setup (run as root) [00:01] [FAIL]
functional/cli_root/zpool_scrub/zpool_scrub_multiple_copies (run as root) [00:03] [FAIL]
functional/grow_pool/setup (run as root) [00:01] [FAIL]
functional/grow_replicas/grow_replicas_001_pos (run as root) [00:01] [FAIL]
functional/inuse/inuse_003_pos (run as root) [00:02] [FAIL]
functional/inuse/inuse_005_pos (run as root) [00:03] [FAIL]
functional/inuse/inuse_006_pos (run as root) [00:03] [FAIL]
functional/inuse/inuse_007_pos (run as root) [00:02] [FAIL]
functional/inuse/inuse_008_pos (run as root) [00:02] [FAIL]
functional/inuse/inuse_009_pos (run as root) [00:02] [FAIL]
functional/migration/setup (run as root) [00:01] [FAIL]
functional/no_space/setup (run as root) [00:01] [FAIL]
functional/persist_l2arc/persist_l2arc_008_pos (run as root) [00:48] [FAIL]
functional/pool_checkpoint/setup (run as root) [01:50] [FAIL]
functional/pool_names/pool_names_001_pos (run as root) [00:01] [FAIL]
functional/poolversion/setup (run as root) [00:01] [FAIL]
functional/poolversion/cleanup (run as root) [00:01] [FAIL]
functional/projectquota/projectquota_003_pos (run as root) [00:01] [FAIL]
functional/refreserv/refreserv_004_pos (run as root) [00:04] [FAIL]
functional/removal/removal_with_add (run as root) [00:34] [FAIL]
functional/removal/remove_raidz (run as root) [00:56] [FAIL]
functional/rsend/rsend_008_pos (run as root) [00:02] [FAIL]
functional/scrub_mirror/setup (run as root) [00:01] [FAIL]
functional/slog/slog_014_pos (run as root) [00:40] [FAIL]
functional/vdev_zaps/vdev_zaps_007_pos (run as root) [00:08] [FAIL]
functional/write_dirs/setup (run as root) [00:01] [FAIL]

Of particular importance, given where this change is:

Test: /opt/zfs-tests/tests/functional/upgrade/setup (run as root) [00:00] [PASS]
Test: /opt/zfs-tests/tests/functional/upgrade/upgrade_userobj_001_pos (run as root) [00:02] [PASS]
Test: /opt/zfs-tests/tests/functional/upgrade/upgrade_projectquota_001_pos (run as root) [00:13] [PASS]
Test: /opt/zfs-tests/tests/functional/upgrade/upgrade_readonly_pool (run as root) [00:01] [PASS]
Test: /opt/zfs-tests/tests/functional/upgrade/cleanup (run as root) [00:00] [PASS]

A number of these test failures are caused by #13324 which was not fixed in gate at the time the tests were run.
The others were failing (at least intermittently) both before and after this change.

Actions #10

Updated by Alex Wilson almost 3 years ago

I cherry-picked cr 1087 onto omnios r151034 to test it on one of our machines which was very reliable at hitting the panic originally (would almost always hit it shortly after boot). So far it hasn't panic'd, so I think this may be a success.

Actions #11

Updated by Electric Monk almost 3 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit a39df08cf5d40d52f3f54886012efca9796f23c8

commit  a39df08cf5d40d52f3f54886012efca9796f23c8
Author: Arkadiusz BubaƂa <arkadiusz.bubala@open-e.com>
Date:   2020-12-04T16:42:10.000Z

    13194 null/dangling pointer deref somewhere under dmu_objset_upgrade
    Portions contributed by: Andy Fiddaman <andy@omniosce.org>
    Reviewed by: Jason King <jason.king@joyent.com>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Jerry Jelinek <gjelinek@gmail.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions #12

Updated by Andy Fiddaman almost 3 years ago

  • Related to Bug #13358: dmu_objset_upgrade_stop() needs to wait added
Actions

Also available in: Atom PDF