Project

General

Profile

Bug #1305

`zfs diff` freezes entire system, forces reboot

Added by Nick Zivkovic over 8 years ago. Updated over 8 years ago.

Status:
Feedback
Priority:
High
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2011-07-29
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

Doing a `zfs diff` on two snapshots within the same data-set on rpool causes the entire system to freeze. ssh access from this point is impossible. No further interaction is possible, until the system reboots itself, or is rebooted by me.

A subsequent `zpool scrub` reveals no corruption.

It may be relevant that this dataset is nested within another dataset (i.e. rpool/ds0/ds1; I'm diffing ds1@004 and ds1@005)

Strangely, only this single dataset is affected, as other datasets seem to be diffing just fine.

Also, diffing two specific snapshots always causes the freeze before the command emits any output. However diffing two other snapshots causes the freeze after the command emits ~25 lines of output.

I am marking this as high priority as someone could unwittingly bring a production system down by doing a single `zfs diff`.

System specs

os: Open Indiana 147
arch: i386
zpool ver: 28
zfs ver: 5
storage: 60GB SSD (OCZ)
dedup: enabled
compression: enabled (global)

History

#1

Updated by Garrett D'Amore over 8 years ago

  • Status changed from New to Feedback

You're using OI 147. There have been some changes in the code since then. Can you please try updating and see if it reproduces?

#2

Updated by Yuri Pankov over 8 years ago

#821 looks somewhat similar, though I wasn't able to "lock up" `zfs diff' in any other situations than described there.

#3

Updated by Nick Zivkovic over 8 years ago

Garret: Just managed to reproduce same 'freeze' behaviour on an OI 148 system with 3 mirrored HDD's. All other parameters are identical, including the dataset (which has been migrated to this other machine for testing, via `zfs send + ssh`).

Very obnoxious bug. Perhaps there is a way to make zfs run in user-land, so that I can dtrace it?

#4

Updated by Nick Zivkovic over 8 years ago

Also, seeing as how this fatal diff error is local to a single dataset on my system, and seeing as how it affects more than just 1 system configuration, I'd be more than happy to send the dataset over to anyone via ssh (or anything capable of transferring 719MB of data).

In particular, the data set is the illumos-gate source tree with many snapshots made to ease cloning (use `zfs clone` instead of `hg clone`) and change-reversion; it is nothing confidential (like customer data) and I'm more than happy to share it, if it will enable the devs to track down this bug any faster.

#5

Updated by Michael Keller over 8 years ago

Same issue here on OI 148 while doing a zdiff on zpool (rpool) version 28:

/var/adm/messages reads:

Aug  4 12:32:19 sm41 unix: [ID 836849 kern.notice] 
Aug  4 12:32:19 sm41 ^Mpanic[cpu1]/thread=ffffff0226a95040: 
Aug  4 12:32:19 sm41 genunix: [ID 335743 kern.notice] BAD TRAP: type=e (#pf Page fault) rp=ffffff000a26b6a0 addr=20 occurred in module "zfs" due to a NULL pointer dereference
Aug  4 12:32:19 sm41 unix: [ID 100000 kern.notice] 
Aug  4 12:32:19 sm41 unix: [ID 839527 kern.notice] zfs: 
Aug  4 12:32:19 sm41 unix: [ID 753105 kern.notice] #pf Page fault
Aug  4 12:32:19 sm41 unix: [ID 532287 kern.notice] Bad kernel fault at addr=0x20
Aug  4 12:32:19 sm41 unix: [ID 243837 kern.notice] pid=19214, pc=0xfffffffff7a118d8, sp=0xffffff000a26b798, eflags=0x10217
Aug  4 12:32:19 sm41 unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 426f8<osxsav,vmxe,xmme,fxsr,pge,mce,pae,pse,de>
Aug  4 12:32:19 sm41 unix: [ID 624947 kern.notice] cr2: 20
Aug  4 12:32:19 sm41 unix: [ID 625075 kern.notice] cr3: 11fe5c000
Aug  4 12:32:19 sm41 unix: [ID 625715 kern.notice] cr8: c
Aug  4 12:32:19 sm41 unix: [ID 100000 kern.notice] 
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      rdi: ffffff01f45fb038 rsi:                0 rdx:                0
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      rcx:         d8041e83  r8:         27fbe178  r9: ffffff000a26b800
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      rax:                7 rbx:                0 rbp: ffffff000a26b7e0
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      r10:                7 r11:                0 r12: ffffff01f45fb038
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      r13: ffffff01f45fb038 r14: ffffff000a26b920 r15: ffffff01e5d4b140
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      fsb:                0 gsb: ffffff01cd2c1b00  ds:               4b
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]       es:               4b  fs:                0  gs:              1c3
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]      trp:                e err:                0 rip: fffffffff7a118d8
Aug  4 12:32:19 sm41 unix: [ID 592667 kern.notice]       cs:               30 rfl:            10217 rsp: ffffff000a26b798
Aug  4 12:32:19 sm41 unix: [ID 266532 kern.notice]       ss:               38
Aug  4 12:32:19 sm41 unix: [ID 100000 kern.notice] 
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b580 unix:die+dd ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b690 unix:trap+1799 ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b6a0 unix:cmntrap+e6 ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b7e0 zfs:zap_leaf_lookup_closest+40 ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b870 zfs:fzap_cursor_retrieve+c9 ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b900 zfs:zap_cursor_retrieve+188 ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26b9b0 zfs:zap_value_search+7c ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bb80 zfs:zfs_obj_to_path_impl+11d ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bc00 zfs:zfs_obj_to_stats+ae ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bc40 zfs:zfs_ioc_obj_to_stats+7b ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bcc0 zfs:zfsdev_ioctl+15e ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bd00 genunix:cdev_ioctl+45 ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bd40 specfs:spec_ioctl+5a ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bdc0 genunix:fop_ioctl+7b ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bec0 genunix:ioctl+18e ()
Aug  4 12:32:19 sm41 genunix: [ID 655072 kern.notice] ffffff000a26bf10 unix:brand_sys_sysenter+1c9 ()
Aug  4 12:32:19 sm41 unix: [ID 100000 kern.notice] 
Aug  4 12:32:19 sm41 genunix: [ID 672855 kern.notice] syncing file systems...
Aug  4 12:32:19 sm41 genunix: [ID 733762 kern.notice]  13
Aug  4 12:32:20 sm41 genunix: [ID 904073 kern.notice]  done

#6

Updated by Michael Keller over 8 years ago

Maybe it is related to the fact that the zdiff was done on the latest snapshot, while doing zdiffs on a snapshot which is not the latest one always works fine...

To clarify:

/usr/sbin/zfs list -t snapshot
NAME                                                USED  AVAIL  REFER  MOUNTPOINT
rpool/zonefs/contbuild/ROOT/zbe@build               112K      -  3.59G  -
rpool/zonefs/contbuild/ROOT/zbe@install             221K      -  3.60G  -

zdiff on @install did the panic, while zdiff @build will never panic...

I tried but I can't reproduce the panic on the latest snapshot. But I can say for sure that doing zdiff's on a earlier snapshot NEVER paniced for me (and I'm doing them a lot)...

#7

Updated by Nick Zivkovic over 8 years ago

@Michael Keller:
Unfortunately, zdiffs panic on the three most recent snapshot pairs on my system, for a particular dataset.

i.e.:
If the snapshots are like so:
pool/ds@000
pool/ds@001
pool/ds@002
pool/ds@003
pool/ds@004

The following commands cause panics:
zfs diff pool/ds@002 pool/ds@003
zfs diff pool/ds@003 pool/ds@004
zfs diff pool/ds@004 pool/ds

All commands that diff snapshots before those three, work fine.

What's more interesting is any subsequently made snapshots (i.e. pool/ds@005) will also panic when diffed against any other snapshot or the dataset's current image (i.e. pool/ds).

Also available in: Atom PDF