Bug #6537
closedPanic on zpool scrub with DEBUG kernel
100%
Description
On my T2000 (SPARC sun4v) running oi_151a8, I built a recent version of illumos and did an ONU. This procedure installed a DEBUG kernel. It ran normally, but hit a panic and core dump when the scrub ran from the root crontab. The root pool was created during installation of oi_151a8. Here's my analysis of the dump:
> ::stack dsl_scan_sync+0x424(30067b40ac0, 3007b950600, 3007b950600, 7baebbf8, 7baebbf8, 7bae5dc0) spa_sync+0x250(30063b29000, 3fbe6b, 30063b29440, 30063b29410, 19d7120, 30069527440) txg_sync_thread+0x204(30067b40ac0, 7baf0800, 7baf0940, 7baf0728, 30067b40c80, 19d6470) thread_start+4(30067b40ac0, 0, 0, 0, 0, 0) > ::panicinfo cpu 1 thread 2a1014b3c80 message BAD TRAP: type=31 rp=2a1014b3720 addr=58 mmu_fsr=0 occurred in module "zfs" due to a NULL pointer dereference tstate 44e2001601 g1 0 g2 1387 g3 0 g4 1 g5 18 g6 10 g7 2a1014b3c80 o0 0 o1 0 o2 d555554 o3 0 o4 9bd22400 o5 0 o6 2a1014b2fc1 o7 7ba50dcc pc 7ba51164 npc 7ba51168 y 0 sfsr 0 sfar 58 tt 31 > dsl_scan_sync+0x424::dis dsl_scan_sync+0x3fc: ldx [%g1 + 0x30], %o3 dsl_scan_sync+0x400: neg %o4 dsl_scan_sync+0x404: neg %o2 dsl_scan_sync+0x408: call -0x17f08 <dsl_dir_diduse_space> dsl_scan_sync+0x40c: neg %o3 dsl_scan_sync+0x410: ld [%l0 + 0x34], %g1 dsl_scan_sync+0x414: cmp %g1, 0x0 dsl_scan_sync+0x418: bne,a,pn %icc, -0x254 <dsl_scan_sync+0x1c4> dsl_scan_sync+0x41c: ldx [%l0 + 0x50], %g1 dsl_scan_sync+0x420: ldx [%i0 + 0x20], %g1 dsl_scan_sync+0x424: ldx [%g1 + 0x58], %g1 dsl_scan_sync+0x428: ldx [%g1 + 0x18], %g1 dsl_scan_sync+0x42c: ldx [%g1 + 0x28], %o1 dsl_scan_sync+0x430: brnz,pn %o1, +0x50 <dsl_scan_sync+0x480> dsl_scan_sync+0x434: sethi %hi(0x7bae8400), %o0 dsl_scan_sync+0x438: ldx [%g1 + 0x30], %o1 dsl_scan_sync+0x43c: brnz,pn %o1, +0x10 <dsl_scan_sync+0x44c> dsl_scan_sync+0x440: sethi %hi(0x7bae8400), %o0 dsl_scan_sync+0x444: ba,pt %xcc, -0x28c <dsl_scan_sync+0x1b8> dsl_scan_sync+0x448: ldx [%g1 + 0x38], %o1 dsl_scan_sync+0x44c: sethi %hi(0x7bae7400), %o2 > ^D
The disassembly corresponds to this piece of source in dsl_scan.c:
dsl_dir_diduse_space(dp->dp_free_dir, DD_USED_HEAD, -dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes, -dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes, -dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx); } if (!scn->scn_async_destroying) { /* finished; verify that space accounting went to zero */ ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes); ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes); ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes); }
It seems to be dereferencing a NULL pointer during processing of the ASSERT0 macro. This change to dsl_scan.c resolves the problem:
-dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes, -dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx); } - if (!scn->scn_async_destroying) { + if (dp->dp_free_dir != NULL && !scn->scn_async_destroying) { /* finished; verify that space accounting went to zero */ ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes); ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes);
The error could have been missed in testing because it only affects the DEBUG kernel. I don't know what caused that pointer to become NULL. My systems run a scrub once a night.
Updated by Gary Mills about 7 years ago
I now know what caused the pointer to be NULL. The scrub that caused the kernel panic was on a data pool that I had deliberately created with version 15. With such an old version, dp_free_dir is always NULL because the directory does not exist. The fix above correctly resolves the problem.
Updated by Electric Monk about 7 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 8c04a1fa3f7d569d48fe9b5342d0bd4c533179b9
commit 8c04a1fa3f7d569d48fe9b5342d0bd4c533179b9 Author: Gary Mills <gary_mills@fastmail.fm> Date: 2016-01-13T21:11:38.000Z 6537 Panic on zpool scrub with DEBUG kernel Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com>