Project

General

Profile

Bug #13194

null/dangling pointer deref somewhere under dmu_objset_upgrade

Added by Alex Wilson 28 days ago. Updated 28 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

After the integration of #11479 zfs project support, we've had machines hit panics shortly after upgrade (and re-hit similar panics after rebooting). Some machines hit it fairly reliably early on in startup, while others hit it much later while recv'ing zfs send streams.

The panic varies, but often looks like:

> ::status
debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c
operating system: 5.11 joyent_20190919T022916Z (i86pc)
git branch: master
git rev: 945e7932e079324beb3dba245ce6bc8f401695da
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
dump content: kernel pages only

> $C
fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198)
fffffbe364161c10 thread_start+0xb()

Sometimes stacks appear which involve dmu_objset_upgrade_task_cb as well. Dump following the panic usually succeeds.

We haven't been able to reproduce this on any non-production systems, which makes debugging more challenging. We also have observed permanent data corruption in ZFS pools following this panic on two machines, meaning we are very loathe to reproduce it further in production. I'll add my initial analysis of one of the dumps as a comment.

When we clobber the upgrade process that was introduced for #11479 by adding return; at the first line of dmu_objset_upgrade, these panics do not occur and our systems run as normal. We have patched all of our production illumos systems in this way since late last year to avoid this panic.


Related issues

Related to illumos gate - Bug #11479: zfs project supportClosedJerry Jelinek

Actions
#1

Updated by Alex Wilson 28 days ago

[root@24-6e-96-31-df-4c (gps-1) /var/crash/volatile]# mdb {unix,vmcore}.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm sd lofs idm mpt_sas sata crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ]

> ::status
debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c
operating system: 5.11 joyent_20190919T022916Z (i86pc)
git branch: master
git rev: 945e7932e079324beb3dba245ce6bc8f401695da
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
dump content: kernel pages only

> $C
fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198)
fffffbe364161c10 thread_start+0xb()

(we jumped to null, so the last stack frame is gonna be bonkers and not come up here)

> ::regs
%rax = 0x0000000000000000                 %r9  = 0xfffffbe36433cc20
%rbx = 0xfffffdf4c28e49c0                 %r10 = 0x0000000000000000
%rcx = 0x0000000000000003                 %r11 = 0x0000000000000000
%rdx = 0xfffffbe364161c20                 %r12 = 0xfffffdf4c28e4c68
%rsi = 0xfffffdf4befadf18                 %r13 = 0x0000000000000001
%rdi = 0xfffffdf4c28e49c0                 %r14 = 0xfffffdf4c59fd1c0
%r8  = 0xfffffdf45d918000                 %r15 = 0xfffffbe364161ba0

%rip = 0x0000000000000000
%rbp = 0xfffffbe364161b40
%rsp = 0xfffffbe364161b18
%rflags = 0x00010246
  id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
  status=<of,df,IF,tf,sf,ZF,af,PF,cf>

%cs = 0x0030    %ds = 0x004b    %es = 0x004b    %fs = 0x0000
%gs = 0x01c3    %gsbase = 0xfffffdf45ceb6000    %kgsbase = 0x0
%trapno = 0xe   %err = 0x10     %cr2 = 0x0      %cr3 = 0x1f800000

dump the stack out a bit:

> 0xfffffbe364161b18,10/nap
0xfffffbe364161b18:
0xfffffbe364161b18:             dmu_objset_upgrade_task_cb+0x4d     # so we probably got into this function
0xfffffbe364161b20:             0xfffffdf4c59fd198
0xfffffbe364161b28:             0xfffffdf4c59fd1b8
0xfffffbe364161b30:             0xfffffbe364161b40
0xfffffbe364161b38:             0xfffffdf4c28e49c0
0xfffffbe364161b40:             0xfffffbe364161c00
0xfffffbe364161b48:             taskq_thread+0x2d0
0xfffffbe364161b50:             0xbaddcafebaddcafe
0xfffffbe364161b58:             0xfffffdf4befadf18                  # this is probably the taskq_ent_t
0xfffffbe364161b60:             0x1408b1cace2585
0xfffffbe364161b68:             0xfffffdf4c59fd1ca
0xfffffbe364161b70:             0xfffffdf4c59fd1c8
0xfffffbe364161b78:             0xfffffdf4c59fd1cc
0xfffffbe364161b80:             1
0xfffffbe364161b88:             0xfffffdf4c59fd208
0xfffffbe364161b90:             0x15d419000

> 0xfffffdf4befadf18::print -t taskq_ent_t
taskq_ent_t {
    struct taskq_ent *tqent_next = 0xfffffdf4bea8e598
    struct taskq_ent *tqent_prev = 0xfffffdf4c59fd208
    task_func_t *tqent_func = dmu_objset_upgrade_task_cb
    void *tqent_arg = 0xfffffdf4c28e49c0
    union  tqent_un = {
        taskq_bucket_t *tqent_bucket = 0
        uintptr_t tqent_flags = 0
    }
    kthread_t *tqent_thread = 0
    kcondvar_t tqent_cv = {
        ushort_t _opaque = 0
    }
}

> dmu_objset_upgrade_task_cb+0x4d::dis
dmu_objset_upgrade_task_cb+0x1e:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x21:call   +0x3a43a0a       <mutex_enter>
dmu_objset_upgrade_task_cb+0x26:movl   0x2c0(%rbx),%r10d
dmu_objset_upgrade_task_cb+0x2d:movl   $0x4,0x2c4(%rbx)
dmu_objset_upgrade_task_cb+0x37:testl  %r10d,%r10d
dmu_objset_upgrade_task_cb+0x3a:jne    +0x1f    <dmu_objset_upgrade_task_cb+0x5b>
dmu_objset_upgrade_task_cb+0x3c:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x3f:call   +0x3a43b0c       <mutex_exit>
dmu_objset_upgrade_task_cb+0x44:movq   %rbx,%rdi
dmu_objset_upgrade_task_cb+0x47:call   *0x2b8(%rbx)                        # looks like this is the deref that sent us to pc=0
dmu_objset_upgrade_task_cb+0x4d:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x50:movl   %eax,0x2c4(%rbx)
dmu_objset_upgrade_task_cb+0x56:call   +0x3a439d5       <mutex_enter>
dmu_objset_upgrade_task_cb+0x5b:movl   $0x1,0x2c0(%rbx)
dmu_objset_upgrade_task_cb+0x65:movq   $0x0,0x2b0(%rbx)
dmu_objset_upgrade_task_cb+0x70:movq   %r12,%rdi
dmu_objset_upgrade_task_cb+0x73:call   +0x3a43ad8       <mutex_exit>
dmu_objset_upgrade_task_cb+0x78:movq   -0x20(%rbp),%rbx
dmu_objset_upgrade_task_cb+0x7c:movq   -0x18(%rbp),%r12
dmu_objset_upgrade_task_cb+0x80:leave
dmu_objset_upgrade_task_cb+0x81:ret

> ::offsetof objset_t os_upgrade_cb
offsetof (objset_t, os_upgrade_cb) = 0x2b8, sizeof (...->os_upgrade_cb) = 8

> 0xfffffdf4c28e49c0::print objset_t os_upgrade_cb
os_upgrade_cb = 0

the rest of this objset_t looks bonkers too:

objset_t {
    struct dsl_dataset *os_dsl_dataset = 0xd000d000d000d
    spa_t *os_spa = 0xd000d000d000d
    arc_buf_t *os_phys_buf = 0xd000d000d000d
    objset_phys_t *os_phys = 0xd000d000d000d
    boolean_t os_encrypted = 0xd000d (B_TRUE|0xd000c)
    dnode_handle_t os_meta_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    dnode_handle_t os_userused_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    dnode_handle_t os_groupused_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    dnode_handle_t os_projectused_dnode = {
        zrlock_t dnh_zrlock = {
            kmutex_t zr_mtx = {
                void *[1] _opaque = [ 0xd000d000d000d ]
            }
            volatile int32_t zr_refcount = 0xd000d
            kcondvar_t zr_cv = {
                ushort_t _opaque = 0xd
            }
            uint16_t zr_pad = 0xd
        }
        dnode_t *dnh_dnode = 0xd000d000d000d
    }
    zilog_t *os_zil = 0xd000d000d000d
    list_node_t os_evicting_node = {
        struct list_node *list_next = 0xd000d000d000d
...
        uint64_t zh_flags = 0xd000d000d000d
        uint64_t zh_claim_lr_seq = 0xd000d000d000d
        uint64_t [3] zh_pad = [ 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d ]
    }
    multilist_t *os_synced_dnodes = 0xd000d000d000d
    uint64_t os_flags = 0xd000d000d000d
    uint64_t os_freed_dnodes = 0xd000d000d000d
    boolean_t os_rescan_dnodes = 0xd000d (B_TRUE|0xd000c)
    boolean_t os_raw_receive = 0xd000d (B_TRUE|0xd000c)
    boolean_t [4] os_next_write_raw = [ 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c) ]
    kmutex_t os_obj_lock = {
        void *[1] _opaque = [ 0xd000d000d000d ]
    }
    uint64_t os_obj_next_chunk = 0xd000d000d000d
    uint64_t *os_obj_next_percpu = 0xd000d000d000d
    int os_obj_next_percpu_len = 0xd000d
    kmutex_t os_lock = {
        void *[1] _opaque = [ 0xd000d000d000d ]
    }
    multilist_t *[4] os_dirty_dnodes = [ 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d ]
    list_t os_dnodes = {
        size_t list_size = 0xd000d000d000d
        size_t list_offset = 0xd000d000d000d
        struct list_node list_head = {
            struct list_node *list_next = 0xd000d000d000d
            struct list_node *list_prev = 0xd000d000d000d
        }
    }
    list_t os_downgraded_dbufs = {
        size_t list_size = 0xd000d000d000d
        size_t list_offset = 0xd000d000d000d
        struct list_node list_head = {
            struct list_node *list_next = 0xd000d000d000d
            struct list_node *list_prev = 0xd000d000d000d
        }
    }
    kmutex_t os_userused_lock = {
        void *[1] _opaque = [ 0xd ]
    }
    kmutex_t os_user_ptr_lock = {
        void *[1] _opaque = [ 0 ]
    }
    void *os_user_ptr = 0
    sa_os_t *os_sa = 0x50
    kmutex_t os_upgrade_lock = {
        void *[1] _opaque = [ 0 ]
    }
    taskqid_t os_upgrade_id = 0
    dmu_objset_upgrade_cb_t os_upgrade_cb = 0
    boolean_t os_upgrade_exit = 0 (0)
    int os_upgrade_status = 0x4
}
#2

Updated by Alex Wilson 28 days ago

Some further observations:

  • This seems to occur most readily on machines which are doing lots of metadata operations (like creating and destroying snapshots or datasets)
  • The path which sends us here (dmu_objset_upgrade) is called during mount, so if datasets are set not to auto-mount, they can't seem to trigger it
  • We've had this occur on both SmartOS and OmniOS(ce)
  • We've observed this periodically ever since #11479 was integrated. This is not a recent new issue. We patched all of our production systems as of early this year to neuter the dataset upgrade code and have only observed it since then when machines were accidentally booted to an un-patched kernel.
  • Timing, stack contents, and specifics are quite variable, so I suspect the underlying bug might be a race or have a strong timing component
  • We have 3 different dumps of this issue available, but they all contain potentially sensitive data from production systems, which makes it difficult to share them.
#3

Updated by Alex Wilson 28 days ago

An example of one of the machines with permanent data corruption after hitting this:

[root@90-e2-ba-d3-ec-a0 (gps-1) ~]# zpool status -v
  pool: zones
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:32:47 with 1 errors on Tue Jul 21 08:36:53 2020
config:

        NAME                       STATE     READ WRITE CKSUM
        zones                      ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            c1t5000C500862213D5d0  ONLINE       0     0     0
            c2t5000C500862206BDd0  ONLINE       0     0     0
            c3t5000C50085402429d0  ONLINE       0     0     0
            c4t5000C50085CEBDE5d0  ONLINE       0     0     0
        logs    
          mirror-1                 ONLINE       0     0     0
            c0t55CD2E404B7065F4d0  ONLINE       0     0     0
            c0t55CD2E404B707EDCd0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /var/run/agents-installer.sh

Observations:

  • The permanent data corruption seems to hit the oldest files on the pool (perhaps even just the oldest file on the pool)
  • It persists after scrubbing 10+ times and seems to be truly permanent corruption.
  • In this particular case on smartos it's very annoying to delete this file, since /var is mounted over later.
#4

Updated by Marcel Telka 28 days ago

  • Related to Bug #11479: zfs project support added
#5

Updated by Alex Wilson 28 days ago

  • Subject changed from null/dangling pointer deref somewhere under spa_upgrade_taskq to null/dangling pointer deref somewhere under dmu_objset_upgrade

Also available in: Atom PDF