Project

General

Profile

Bug #6537

Panic on zpool scrub with DEBUG kernel

Added by Gary Mills over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2015-12-30
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

On my T2000 (SPARC sun4v) running oi_151a8, I built a recent version of illumos and did an ONU. This procedure installed a DEBUG kernel. It ran normally, but hit a panic and core dump when the scrub ran from the root crontab. The root pool was created during installation of oi_151a8. Here's my analysis of the dump:

> ::stack
dsl_scan_sync+0x424(30067b40ac0, 3007b950600, 3007b950600, 7baebbf8, 7baebbf8, 
7bae5dc0)
spa_sync+0x250(30063b29000, 3fbe6b, 30063b29440, 30063b29410, 19d7120, 
30069527440)
txg_sync_thread+0x204(30067b40ac0, 7baf0800, 7baf0940, 7baf0728, 30067b40c80, 
19d6470)
thread_start+4(30067b40ac0, 0, 0, 0, 0, 0)
> ::panicinfo
             cpu                1
          thread      2a1014b3c80
         message 
BAD TRAP: type=31 rp=2a1014b3720 addr=58 mmu_fsr=0 occurred in module "zfs" due to a NULL pointer dereference
          tstate       44e2001601
              g1                0
              g2             1387
              g3                0
              g4                1
              g5               18
              g6               10
              g7      2a1014b3c80
              o0                0
              o1                0
              o2          d555554
              o3                0
              o4         9bd22400
              o5                0
              o6      2a1014b2fc1
              o7         7ba50dcc
              pc         7ba51164
             npc         7ba51168
               y                0     
            sfsr                0
            sfar               58
              tt               31
> dsl_scan_sync+0x424::dis
dsl_scan_sync+0x3fc:            ldx       [%g1 + 0x30], %o3
dsl_scan_sync+0x400:            neg       %o4
dsl_scan_sync+0x404:            neg       %o2
dsl_scan_sync+0x408:            call      -0x17f08      <dsl_dir_diduse_space>
dsl_scan_sync+0x40c:            neg       %o3
dsl_scan_sync+0x410:            ld        [%l0 + 0x34], %g1
dsl_scan_sync+0x414:            cmp       %g1, 0x0
dsl_scan_sync+0x418:            bne,a,pn  %icc, -0x254  <dsl_scan_sync+0x1c4>
dsl_scan_sync+0x41c:            ldx       [%l0 + 0x50], %g1
dsl_scan_sync+0x420:            ldx       [%i0 + 0x20], %g1
dsl_scan_sync+0x424:            ldx       [%g1 + 0x58], %g1
dsl_scan_sync+0x428:            ldx       [%g1 + 0x18], %g1
dsl_scan_sync+0x42c:            ldx       [%g1 + 0x28], %o1
dsl_scan_sync+0x430:            brnz,pn   %o1, +0x50    <dsl_scan_sync+0x480>
dsl_scan_sync+0x434:            sethi     %hi(0x7bae8400), %o0
dsl_scan_sync+0x438:            ldx       [%g1 + 0x30], %o1
dsl_scan_sync+0x43c:            brnz,pn   %o1, +0x10    <dsl_scan_sync+0x44c>
dsl_scan_sync+0x440:            sethi     %hi(0x7bae8400), %o0
dsl_scan_sync+0x444:            ba,pt     %xcc, -0x28c  <dsl_scan_sync+0x1b8>
dsl_scan_sync+0x448:            ldx       [%g1 + 0x38], %o1
dsl_scan_sync+0x44c:            sethi     %hi(0x7bae7400), %o2
>  ^D

The disassembly corresponds to this piece of source in dsl_scan.c:

        dsl_dir_diduse_space(dp->dp_free_dir, DD_USED_HEAD,
            -dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes,
            -dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes,
            -dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx);
    }
    if (!scn->scn_async_destroying) {
        /* finished; verify that space accounting went to zero */
        ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes);
        ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes);
        ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes);
    }

It seems to be dereferencing a NULL pointer during processing of the ASSERT0 macro. This change to dsl_scan.c resolves the problem:

             -dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes,
             -dsl_dir_phys(dp->dp_free_dir)->dd_uncompressed_bytes, tx);
     }
-    if (!scn->scn_async_destroying) {
+    if (dp->dp_free_dir != NULL && !scn->scn_async_destroying) {
         /* finished; verify that space accounting went to zero */
         ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes);
         ASSERT0(dsl_dir_phys(dp->dp_free_dir)->dd_compressed_bytes);

The error could have been missed in testing because it only affects the DEBUG kernel. I don't know what caused that pointer to become NULL. My systems run a scrub once a night.

History

#1

Updated by Gary Mills over 3 years ago

I now know what caused the pointer to be NULL. The scrub that caused the kernel panic was on a data pool that I had deliberately created with version 15. With such an old version, dp_free_dir is always NULL because the directory does not exist. The fix above correctly resolves the problem.

#2

Updated by Electric Monk over 3 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 8c04a1fa3f7d569d48fe9b5342d0bd4c533179b9

commit  8c04a1fa3f7d569d48fe9b5342d0bd4c533179b9
Author: Gary Mills <gary_mills@fastmail.fm>
Date:   2016-01-13T21:11:38.000Z

    6537 Panic on zpool scrub with DEBUG kernel
    Reviewed by: Steve Gonczi <gonczi@comcast.net>
    Reviewed by: Dan McDonald <danmcd@omniti.com>
    Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Matthew Ahrens <mahrens@delphix.com>

Also available in: Atom PDF