Project

General

Profile

Bug #10445

ZFS assertion failed in dsl_scan.c:1697 on snapshot destruction

Added by Hans Rosenfeld 5 months ago. Updated 5 months ago.

Status:
New
Priority:
High
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2019-02-24
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

I just switched one of my OI machines to DEBUG bits to do some kernel work there. A few minutes after booting time slider destroyed an old snapshot, which triggered this assertion failure. While I was typing this it happened again, so it's easily reproducable.

panic[cpu1]/thread=ffffd000402d2c40: assertion failed: dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes == 0 (0x14000 == 0x0), file: ../../common/fs/zfs/dsl_scan.c, line: 1697
> $C
ffffd000402d2970 vpanic()
ffffd000402d29c0 0xfffffffffbe0c59d()
ffffd000402d2a20 dsl_process_async_destroys+0x4da(ffffd00d5fa4a600, ffffd00df0047a00)
ffffd000402d2a80 dsl_scan_sync+0x116(ffffd00d5fa4a600, ffffd00df0047a00)
ffffd000402d2b50 spa_sync+0x533(ffffd00d5ac7e000, 236c191)
ffffd000402d2c20 txg_sync_thread+0x297(ffffd00d5fa4a600)
ffffd000402d2c30 thread_start+8()
> ::ps ! grep zfs
R 103615 101329 101329 101329      0 0x5a006000 ffffd00ebc451040 zfs
> ffffd00ebc451040::ps -f
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R 103615 101329 101329 101329      0 0x5a006000 ffffd00ebc451040 
/usr/sbin/zfs destroy -d export/home/woodstoc@zfs-auto-snap_frequent-2019-02-24

History

#1

Updated by Igor Kozhukhov 5 months ago

i see the same issue on dilos on DEBUG build with destroy older BE.
i can't reproduce this issue by ZFS tests(about 1200)
i thought it was related to my updates from ZoL, but right now, i think it can be related to feature 'large dnode' after integration

#2

Updated by Toomas Soome 5 months ago

Igor Kozhukhov wrote:

i see the same issue on dilos on DEBUG build with destroy older BE.
i can't reproduce this issue by ZFS tests(about 1200)
i thought it was related to my updates from ZoL, but right now, i think it can be related to feature 'large dnode' after integration

I did just try out the snapshot destroy via beadm destroy:

root@uefi-oi:/home/tsoome# beadm list
BE Active Mountpoint Space Policy Created
oi-20 - - 15,0M static 2018-10-06 10:36
oi-21 - - 14,2M static 2018-11-10 13:12
oi-22 - - 14,4M static 2018-11-18 16:48
oi-23 - - 14,7M static 2018-12-02 11:10
oi-24 - - 14,9M static 2018-12-13 09:01
oi-25 - - 13,8M static 2019-01-08 12:41
oi-26 - - 14,5M static 2019-01-09 17:51
oi-27 - - 14,2M static 2019-01-20 12:25
oi-28 - - 14,1M static 2019-02-17 19:04
oi-29 NR / 7,70G static 2019-02-24 18:44
root@uefi-oi:/home/tsoome#

And I did beadm destroy them all without a panic. That VM was updated to DEBUG mode build before this test.. Which of course only hints there must be more context.

#3

Updated by Hans Rosenfeld 5 months ago

I've tried going back to an earlier BE running illumos-fb2612809e, and the panic happens there too.

#4

Updated by Hans Rosenfeld 5 months ago

Igor reported an interesting panic that looks close:

panic[cpu3]/thread=fffffe000fd2ec20: assertion failed: dsl_dir_phys(dp->dp_free_dir)->dd_used_bytes == 0 (0xffffffffff9b1000 == 0x0), file: ../../common/fs/zfs/dsl_scan.c, line: 3219

fffffe000fd2e990 genunix:process_type+193699 ()
fffffe000fd2e9f0 zfs:dsl_process_async_destroys+49a ()
fffffe000fd2ea70 zfs:dsl_scan_sync+1cb ()
fffffe000fd2eb50 zfs:spa_sync+603 ()
fffffe000fd2ec00 zfs:txg_sync_thread+2a7 ()
fffffe000fd2ec10 unix:thread_start+8 ()

The value vor dd_used_bytes looks particularly interesting. I should add that Igor runs with a load of patches backported from everywhere that only he knows, so it's hard to say whether this is truly relevant to this bug.

Also available in: Atom PDF