Bug #13194
closednull/dangling pointer deref somewhere under dmu_objset_upgrade
100%
Description
After the integration of #11479 zfs project support, we've had machines hit panics shortly after upgrade (and re-hit similar panics after rebooting). Some machines hit it fairly reliably early on in startup, while others hit it much later while recv'ing zfs send streams.
The panic varies, but often looks like:
> ::status debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c operating system: 5.11 joyent_20190919T022916Z (i86pc) git branch: master git rev: 945e7932e079324beb3dba245ce6bc8f401695da image uuid: (not set) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference dump content: kernel pages only > $C fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198) fffffbe364161c10 thread_start+0xb()
Sometimes stacks appear which involve dmu_objset_upgrade_task_cb
as well. Dump following the panic usually succeeds.
We haven't been able to reproduce this on any non-production systems, which makes debugging more challenging. We also have observed permanent data corruption in ZFS pools following this panic on two machines, meaning we are very loathe to reproduce it further in production. I'll add my initial analysis of one of the dumps as a comment.
When we clobber the upgrade process that was introduced for #11479 by adding return;
at the first line of dmu_objset_upgrade
, these panics do not occur and our systems run as normal. We have patched all of our production illumos systems in this way since late last year to avoid this panic.
Related issues
Updated by Alex Wilson about 3 years ago
[root@24-6e-96-31-df-4c (gps-1) /var/crash/volatile]# mdb {unix,vmcore}.0 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm sd lofs idm mpt_sas sata crypto random cpc logindmux ptm sppp nsmb smbsrv nfs ] > ::status debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c operating system: 5.11 joyent_20190919T022916Z (i86pc) git branch: master git rev: 945e7932e079324beb3dba245ce6bc8f401695da image uuid: (not set) panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference dump content: kernel pages only > $C fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198) fffffbe364161c10 thread_start+0xb() (we jumped to null, so the last stack frame is gonna be bonkers and not come up here) > ::regs %rax = 0x0000000000000000 %r9 = 0xfffffbe36433cc20 %rbx = 0xfffffdf4c28e49c0 %r10 = 0x0000000000000000 %rcx = 0x0000000000000003 %r11 = 0x0000000000000000 %rdx = 0xfffffbe364161c20 %r12 = 0xfffffdf4c28e4c68 %rsi = 0xfffffdf4befadf18 %r13 = 0x0000000000000001 %rdi = 0xfffffdf4c28e49c0 %r14 = 0xfffffdf4c59fd1c0 %r8 = 0xfffffdf45d918000 %r15 = 0xfffffbe364161ba0 %rip = 0x0000000000000000 %rbp = 0xfffffbe364161b40 %rsp = 0xfffffbe364161b18 %rflags = 0x00010246 id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0 status=<of,df,IF,tf,sf,ZF,af,PF,cf> %cs = 0x0030 %ds = 0x004b %es = 0x004b %fs = 0x0000 %gs = 0x01c3 %gsbase = 0xfffffdf45ceb6000 %kgsbase = 0x0 %trapno = 0xe %err = 0x10 %cr2 = 0x0 %cr3 = 0x1f800000 dump the stack out a bit: > 0xfffffbe364161b18,10/nap 0xfffffbe364161b18: 0xfffffbe364161b18: dmu_objset_upgrade_task_cb+0x4d # so we probably got into this function 0xfffffbe364161b20: 0xfffffdf4c59fd198 0xfffffbe364161b28: 0xfffffdf4c59fd1b8 0xfffffbe364161b30: 0xfffffbe364161b40 0xfffffbe364161b38: 0xfffffdf4c28e49c0 0xfffffbe364161b40: 0xfffffbe364161c00 0xfffffbe364161b48: taskq_thread+0x2d0 0xfffffbe364161b50: 0xbaddcafebaddcafe 0xfffffbe364161b58: 0xfffffdf4befadf18 # this is probably the taskq_ent_t 0xfffffbe364161b60: 0x1408b1cace2585 0xfffffbe364161b68: 0xfffffdf4c59fd1ca 0xfffffbe364161b70: 0xfffffdf4c59fd1c8 0xfffffbe364161b78: 0xfffffdf4c59fd1cc 0xfffffbe364161b80: 1 0xfffffbe364161b88: 0xfffffdf4c59fd208 0xfffffbe364161b90: 0x15d419000 > 0xfffffdf4befadf18::print -t taskq_ent_t taskq_ent_t { struct taskq_ent *tqent_next = 0xfffffdf4bea8e598 struct taskq_ent *tqent_prev = 0xfffffdf4c59fd208 task_func_t *tqent_func = dmu_objset_upgrade_task_cb void *tqent_arg = 0xfffffdf4c28e49c0 union tqent_un = { taskq_bucket_t *tqent_bucket = 0 uintptr_t tqent_flags = 0 } kthread_t *tqent_thread = 0 kcondvar_t tqent_cv = { ushort_t _opaque = 0 } } > dmu_objset_upgrade_task_cb+0x4d::dis dmu_objset_upgrade_task_cb+0x1e:movq %r12,%rdi dmu_objset_upgrade_task_cb+0x21:call +0x3a43a0a <mutex_enter> dmu_objset_upgrade_task_cb+0x26:movl 0x2c0(%rbx),%r10d dmu_objset_upgrade_task_cb+0x2d:movl $0x4,0x2c4(%rbx) dmu_objset_upgrade_task_cb+0x37:testl %r10d,%r10d dmu_objset_upgrade_task_cb+0x3a:jne +0x1f <dmu_objset_upgrade_task_cb+0x5b> dmu_objset_upgrade_task_cb+0x3c:movq %r12,%rdi dmu_objset_upgrade_task_cb+0x3f:call +0x3a43b0c <mutex_exit> dmu_objset_upgrade_task_cb+0x44:movq %rbx,%rdi dmu_objset_upgrade_task_cb+0x47:call *0x2b8(%rbx) # looks like this is the deref that sent us to pc=0 dmu_objset_upgrade_task_cb+0x4d:movq %r12,%rdi dmu_objset_upgrade_task_cb+0x50:movl %eax,0x2c4(%rbx) dmu_objset_upgrade_task_cb+0x56:call +0x3a439d5 <mutex_enter> dmu_objset_upgrade_task_cb+0x5b:movl $0x1,0x2c0(%rbx) dmu_objset_upgrade_task_cb+0x65:movq $0x0,0x2b0(%rbx) dmu_objset_upgrade_task_cb+0x70:movq %r12,%rdi dmu_objset_upgrade_task_cb+0x73:call +0x3a43ad8 <mutex_exit> dmu_objset_upgrade_task_cb+0x78:movq -0x20(%rbp),%rbx dmu_objset_upgrade_task_cb+0x7c:movq -0x18(%rbp),%r12 dmu_objset_upgrade_task_cb+0x80:leave dmu_objset_upgrade_task_cb+0x81:ret > ::offsetof objset_t os_upgrade_cb offsetof (objset_t, os_upgrade_cb) = 0x2b8, sizeof (...->os_upgrade_cb) = 8 > 0xfffffdf4c28e49c0::print objset_t os_upgrade_cb os_upgrade_cb = 0 the rest of this objset_t looks bonkers too: objset_t { struct dsl_dataset *os_dsl_dataset = 0xd000d000d000d spa_t *os_spa = 0xd000d000d000d arc_buf_t *os_phys_buf = 0xd000d000d000d objset_phys_t *os_phys = 0xd000d000d000d boolean_t os_encrypted = 0xd000d (B_TRUE|0xd000c) dnode_handle_t os_meta_dnode = { zrlock_t dnh_zrlock = { kmutex_t zr_mtx = { void *[1] _opaque = [ 0xd000d000d000d ] } volatile int32_t zr_refcount = 0xd000d kcondvar_t zr_cv = { ushort_t _opaque = 0xd } uint16_t zr_pad = 0xd } dnode_t *dnh_dnode = 0xd000d000d000d } dnode_handle_t os_userused_dnode = { zrlock_t dnh_zrlock = { kmutex_t zr_mtx = { void *[1] _opaque = [ 0xd000d000d000d ] } volatile int32_t zr_refcount = 0xd000d kcondvar_t zr_cv = { ushort_t _opaque = 0xd } uint16_t zr_pad = 0xd } dnode_t *dnh_dnode = 0xd000d000d000d } dnode_handle_t os_groupused_dnode = { zrlock_t dnh_zrlock = { kmutex_t zr_mtx = { void *[1] _opaque = [ 0xd000d000d000d ] } volatile int32_t zr_refcount = 0xd000d kcondvar_t zr_cv = { ushort_t _opaque = 0xd } uint16_t zr_pad = 0xd } dnode_t *dnh_dnode = 0xd000d000d000d } dnode_handle_t os_projectused_dnode = { zrlock_t dnh_zrlock = { kmutex_t zr_mtx = { void *[1] _opaque = [ 0xd000d000d000d ] } volatile int32_t zr_refcount = 0xd000d kcondvar_t zr_cv = { ushort_t _opaque = 0xd } uint16_t zr_pad = 0xd } dnode_t *dnh_dnode = 0xd000d000d000d } zilog_t *os_zil = 0xd000d000d000d list_node_t os_evicting_node = { struct list_node *list_next = 0xd000d000d000d ... uint64_t zh_flags = 0xd000d000d000d uint64_t zh_claim_lr_seq = 0xd000d000d000d uint64_t [3] zh_pad = [ 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d ] } multilist_t *os_synced_dnodes = 0xd000d000d000d uint64_t os_flags = 0xd000d000d000d uint64_t os_freed_dnodes = 0xd000d000d000d boolean_t os_rescan_dnodes = 0xd000d (B_TRUE|0xd000c) boolean_t os_raw_receive = 0xd000d (B_TRUE|0xd000c) boolean_t [4] os_next_write_raw = [ 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c), 0xd000d (B_TRUE|0xd000c) ] kmutex_t os_obj_lock = { void *[1] _opaque = [ 0xd000d000d000d ] } uint64_t os_obj_next_chunk = 0xd000d000d000d uint64_t *os_obj_next_percpu = 0xd000d000d000d int os_obj_next_percpu_len = 0xd000d kmutex_t os_lock = { void *[1] _opaque = [ 0xd000d000d000d ] } multilist_t *[4] os_dirty_dnodes = [ 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d, 0xd000d000d000d ] list_t os_dnodes = { size_t list_size = 0xd000d000d000d size_t list_offset = 0xd000d000d000d struct list_node list_head = { struct list_node *list_next = 0xd000d000d000d struct list_node *list_prev = 0xd000d000d000d } } list_t os_downgraded_dbufs = { size_t list_size = 0xd000d000d000d size_t list_offset = 0xd000d000d000d struct list_node list_head = { struct list_node *list_next = 0xd000d000d000d struct list_node *list_prev = 0xd000d000d000d } } kmutex_t os_userused_lock = { void *[1] _opaque = [ 0xd ] } kmutex_t os_user_ptr_lock = { void *[1] _opaque = [ 0 ] } void *os_user_ptr = 0 sa_os_t *os_sa = 0x50 kmutex_t os_upgrade_lock = { void *[1] _opaque = [ 0 ] } taskqid_t os_upgrade_id = 0 dmu_objset_upgrade_cb_t os_upgrade_cb = 0 boolean_t os_upgrade_exit = 0 (0) int os_upgrade_status = 0x4 }
Updated by Alex Wilson about 3 years ago
Some further observations:
- This seems to occur most readily on machines which are doing lots of metadata operations (like creating and destroying snapshots or datasets)
- The path which sends us here (dmu_objset_upgrade) is called during mount, so if datasets are set not to auto-mount, they can't seem to trigger it
- We've had this occur on both SmartOS and OmniOS(ce)
- We've observed this periodically ever since #11479 was integrated. This is not a recent new issue. We patched all of our production systems as of early this year to neuter the dataset upgrade code and have only observed it since then when machines were accidentally booted to an un-patched kernel.
- Timing, stack contents, and specifics are quite variable, so I suspect the underlying bug might be a race or have a strong timing component
- We have 3 different dumps of this issue available, but they all contain potentially sensitive data from production systems, which makes it difficult to share them.
Updated by Alex Wilson about 3 years ago
An example of one of the machines with permanent data corruption after hitting this:
[root@90-e2-ba-d3-ec-a0 (gps-1) ~]# zpool status -v pool: zones state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://illumos.org/msg/ZFS-8000-8A scan: scrub repaired 0 in 0 days 00:32:47 with 1 errors on Tue Jul 21 08:36:53 2020 config: NAME STATE READ WRITE CKSUM zones ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 c1t5000C500862213D5d0 ONLINE 0 0 0 c2t5000C500862206BDd0 ONLINE 0 0 0 c3t5000C50085402429d0 ONLINE 0 0 0 c4t5000C50085CEBDE5d0 ONLINE 0 0 0 logs mirror-1 ONLINE 0 0 0 c0t55CD2E404B7065F4d0 ONLINE 0 0 0 c0t55CD2E404B707EDCd0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: /var/run/agents-installer.sh
Observations:
- The permanent data corruption seems to hit the oldest files on the pool (perhaps even just the oldest file on the pool)
- It persists after scrubbing 10+ times and seems to be truly permanent corruption.
- In this particular case on smartos it's very annoying to delete this file, since
/var
is mounted over later.
Updated by Marcel Telka about 3 years ago
- Related to Bug #11479: zfs project support added
Updated by Alex Wilson about 3 years ago
- Subject changed from null/dangling pointer deref somewhere under spa_upgrade_taskq to null/dangling pointer deref somewhere under dmu_objset_upgrade
Updated by Andy Fiddaman almost 3 years ago
A couple of other OmniOS users have recently hit this, and I think it is the same as https://github.com/openzfs/zfs/issues/5295
I'm pulling the fix over from OpenZFS and I'll also try and replicate using the details in that ticket.
Updated by Andy Fiddaman almost 3 years ago
Details from one of my crash dumps:
> ::status debugging crash dump vmcore.0 (64-bit) from xxxx operating system: 5.11 omnios-r151036-4a32ffb911 (i86pc) build version: gfx-drm - heads/master-0-gbdc58b1-dirty illumos-kvm - heads/r151036-0-g1fba1a9 heads/r151036-0-g4a32ffb911 image uuid: 19459626-713d-c102-f2b6-bd6b4f14d7eb panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe00f910fa30 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference dump content: kernel pages only > > $C fffffe00f910fc00 taskq_thread+0x2cd(fffffeb2002d51d0) fffffe00f910fc10 thread_start+0xb() > > fffffe00f910fa30::whatis fffffe00f910fa30 is in thread fffffe00f910fc20's stack > fffffe00f910fc20::findstack stack pointer for thread fffffe00f910fc20 (tq:z_upgrade): fffffe00f910e6d0 fffffe00f910fc20 0x15() > > ::spa ADDR STATE NAME fffffeb1d409e000 ACTIVE xxxx fffffeb1b989d000 ACTIVE rpool > fffffeb1d409e000::print spa_t spa_upgrade_taskq | ::taskq -t ADDR NAME ACT/THDS Q'ED MAXQ INST fffffeb2002d51d0 z_upgrade 1/ 1 216 - - THREAD STATE SOBJ COUNT fffffe00f910fc20 PANIC <NONE> 1 0x15
> <rsp,5/nap 0xfffffe00f910fb28: 0xfffffe00f910fb28: dmu_objset_upgrade_task_cb+0x51 0xfffffe00f910fb30: 1 0xfffffe00f910fb38: 0xfffffeb2002d51f0 0xfffffe00f910fb40: 0xfffffe00f910fb50 0xfffffe00f910fb48: 0xfffffeb22782d080
> <rdi::whatis fffffeb22782d080 is freed from kmem_alloc_768
Updated by Andy Fiddaman almost 3 years ago
With this change in place, I ran a full ZFS testsuite under a DEBUG kernel. The results are:
Results Summary PASS 1116 FAIL 38 SKIP 127 Running Time: 04:47:15 Percent passed: 87.1%
:% rg -F '[FAIL]' log | cut -d/ -f5- functional/acl/nontrivial/zfs_acl_chmod_inherit_003_pos (run as root) [01:42] [FAIL] functional/bootfs/bootfs_006_pos (run as root) [00:06] [FAIL] functional/clean_mirror/setup (run as root) [00:01] [FAIL] functional/cli_root/zfs_mount/zfs_mount_encrypted (run as root) [00:04] [FAIL] functional/cli_root/zfs_property/zfs_written_property_001_pos (run as root) [00:12] [FAIL] functional/cli_root/zpool_add/setup (run as root) [00:02] [FAIL] functional/cli_root/zpool_create/setup (run as root) [00:02] [FAIL] functional/cli_root/zpool_destroy/zpool_destroy_001_pos (run as root) [00:02] [FAIL] functional/cli_root/zpool_expand/zpool_expand_001_pos (run as root) [00:29] [FAIL] functional/cli_root/zpool_import/setup (run as root) [00:02] [FAIL] functional/cli_root/zpool_initialize/zpool_initialize_online_offline (run as root) [00:05] [FAIL] functional/cli_root/zpool_initialize/zpool_initialize_verify_initialized (run as root) [00:09] [FAIL] functional/cli_root/zpool_remove/setup (run as root) [00:01] [FAIL] functional/cli_root/zpool_scrub/zpool_scrub_multiple_copies (run as root) [00:03] [FAIL] functional/grow_pool/setup (run as root) [00:01] [FAIL] functional/grow_replicas/grow_replicas_001_pos (run as root) [00:01] [FAIL] functional/inuse/inuse_003_pos (run as root) [00:02] [FAIL] functional/inuse/inuse_005_pos (run as root) [00:03] [FAIL] functional/inuse/inuse_006_pos (run as root) [00:03] [FAIL] functional/inuse/inuse_007_pos (run as root) [00:02] [FAIL] functional/inuse/inuse_008_pos (run as root) [00:02] [FAIL] functional/inuse/inuse_009_pos (run as root) [00:02] [FAIL] functional/migration/setup (run as root) [00:01] [FAIL] functional/no_space/setup (run as root) [00:01] [FAIL] functional/persist_l2arc/persist_l2arc_008_pos (run as root) [00:48] [FAIL] functional/pool_checkpoint/setup (run as root) [01:50] [FAIL] functional/pool_names/pool_names_001_pos (run as root) [00:01] [FAIL] functional/poolversion/setup (run as root) [00:01] [FAIL] functional/poolversion/cleanup (run as root) [00:01] [FAIL] functional/projectquota/projectquota_003_pos (run as root) [00:01] [FAIL] functional/refreserv/refreserv_004_pos (run as root) [00:04] [FAIL] functional/removal/removal_with_add (run as root) [00:34] [FAIL] functional/removal/remove_raidz (run as root) [00:56] [FAIL] functional/rsend/rsend_008_pos (run as root) [00:02] [FAIL] functional/scrub_mirror/setup (run as root) [00:01] [FAIL] functional/slog/slog_014_pos (run as root) [00:40] [FAIL] functional/vdev_zaps/vdev_zaps_007_pos (run as root) [00:08] [FAIL] functional/write_dirs/setup (run as root) [00:01] [FAIL]
Of particular importance, given where this change is:
Test: /opt/zfs-tests/tests/functional/upgrade/setup (run as root) [00:00] [PASS] Test: /opt/zfs-tests/tests/functional/upgrade/upgrade_userobj_001_pos (run as root) [00:02] [PASS] Test: /opt/zfs-tests/tests/functional/upgrade/upgrade_projectquota_001_pos (run as root) [00:13] [PASS] Test: /opt/zfs-tests/tests/functional/upgrade/upgrade_readonly_pool (run as root) [00:01] [PASS] Test: /opt/zfs-tests/tests/functional/upgrade/cleanup (run as root) [00:00] [PASS]
A number of these test failures are caused by #13324 which was not fixed in gate at the time the tests were run.
The others were failing (at least intermittently) both before and after this change.
Updated by Alex Wilson almost 3 years ago
I cherry-picked cr 1087 onto omnios r151034 to test it on one of our machines which was very reliable at hitting the panic originally (would almost always hit it shortly after boot). So far it hasn't panic'd, so I think this may be a success.
Updated by Electric Monk almost 3 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit a39df08cf5d40d52f3f54886012efca9796f23c8
commit a39df08cf5d40d52f3f54886012efca9796f23c8 Author: Arkadiusz BubaĆa <arkadiusz.bubala@open-e.com> Date: 2020-12-04T16:42:10.000Z 13194 null/dangling pointer deref somewhere under dmu_objset_upgrade Portions contributed by: Andy Fiddaman <andy@omniosce.org> Reviewed by: Jason King <jason.king@joyent.com> Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Jerry Jelinek <gjelinek@gmail.com> Approved by: Robert Mustacchi <rm@fingolfin.org>
Updated by Andy Fiddaman almost 3 years ago
- Related to Bug #13358: dmu_objset_upgrade_stop() needs to wait added