Project

General

Profile

Actions

Bug #13194

closed

null/dangling pointer deref somewhere under dmu_objset_upgrade

Added by Alex Wilson over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

After the integration of #11479 zfs project support, we've had machines hit panics shortly after upgrade (and re-hit similar panics after rebooting). Some machines hit it fairly reliably early on in startup, while others hit it much later while recv'ing zfs send streams.

The panic varies, but often looks like:

> ::status
debugging crash dump vmcore.0 (64-bit) from 24-6e-96-31-df-4c
operating system: 5.11 joyent_20190919T022916Z (i86pc)
git branch: master
git rev: 945e7932e079324beb3dba245ce6bc8f401695da
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffbe364161a20 addr=0 occurred in module "<unknown>" due to a NULL pointer dereference
dump content: kernel pages only

> $C
fffffbe364161c00 taskq_thread+0x2d0(fffffdf4c59fd198)
fffffbe364161c10 thread_start+0xb()

Sometimes stacks appear which involve dmu_objset_upgrade_task_cb as well. Dump following the panic usually succeeds.

We haven't been able to reproduce this on any non-production systems, which makes debugging more challenging. We also have observed permanent data corruption in ZFS pools following this panic on two machines, meaning we are very loathe to reproduce it further in production. I'll add my initial analysis of one of the dumps as a comment.

When we clobber the upgrade process that was introduced for #11479 by adding return; at the first line of dmu_objset_upgrade, these panics do not occur and our systems run as normal. We have patched all of our production illumos systems in this way since late last year to avoid this panic.


Related issues

Related to illumos gate - Bug #11479: zfs project supportClosedJerry Jelinek

Actions
Related to illumos gate - Bug #13358: dmu_objset_upgrade_stop() needs to waitClosedJason King

Actions
Actions

Also available in: Atom PDF