Bug #956


OI_148a zfs access to a pool hangs (in kernel)

Added by Jim Klimov over 10 years ago. Updated over 10 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


I guess I've beaten my backup machine (on which mosts of my reports were based) too hard now, to the point that many commands hang and do not return. The system usually becomes unresponsive within an hour after reboot (only reset works).

What I can gather now is that my main data pool is imported on every boot (I've disabled my "dcpool" over iSCSI, but this did not remedy the problem), filesystems are mounted and seem accessible. However even with no user activity there are some reads from the pool (ranging from 300Kb/sec to 3Mb/sec), while last week and before there were exact zeros with no user activity.

Kernel time grows until connection to the system is lost, and my assistant is by now tired of walking up to the box to press reset.

Attempts to export the pool, or to scrub it, or to set "canmount=noauto" on some datasets fail. Truss reports things like:

===== This is " truss zfs umount pool/media" end of story:
ioctl(9, MNTIOC_GETMNTANY, 0x08044DC0) = 1
open("/etc/dfs/sharetab", O_RDONLY) = 11
fstat64(11, 0x08045E00) = 0
fcntl(11, F_SETLKW, 0x08045FB0) Err#9 EBADF
brk(0x0813E000) = 0
fstat64(11, 0x08045ED0) = 0
fstat64(11, 0x08045DE0) = 0
ioctl(11, TCGETA, 0x08045E80) Err#22 EINVAL
read(11, " / e x p o r t / h o m e".., 512) = 284
read(11, 0x0811ACCC, 512) = 0
fcntl(11, F_SETLK, 0x08045FB0) = 0
llseek(11, 0, SEEK_CUR) = 284
close(11) = 0
getpid() = 1486 [1485]
door_call(12, 0x08045E60) = 0
getpid() = 1486 [1485]
door_call(12, 0x08045E60) = 0
brk(0x0814F000) = 0
door_call(7, 0x08136008) = 0
door_call(7, 0x08136008) = 0
door_call(7, 0x08136008) = 0
door_call(7, 0x08136008) = 0
stat("/pool/media", 0x08046380) (sleeping...) ===== Hangs forever after this point

On the most recent boot the boot-archive-update service also hung, and there's a hanging add_drv process as well.

"kill -9" or "kill -15" does not interrupt the hung processes, seems they are stuck in kernel,

So far I've found a similar-sounding problem with ZFS in thread

However DTrace doesn't seem to work either:
  1. dtrace -n 'profile-1000Hz{ [stack()]=count() }'
    dtrace: invalid probe specifier profile-1000Hz{
    [stack()]=count() }: "/usr/lib/dtrace/iscsit.d", line 36: ss_family is not a member of struct i_ndi_err

Similar errors happened to me on this box in single-user mode but never yet when the OS had fully booted.

That thread also mentions that the poster had similar problems when the pool was full. However, this pool reports as available at least 280Gb by "df" and 490Gb by "zpool list" counters. However, things seemed to have broken exactly when I was trying to clear up some old snapshots to recover space.

Possibly, something went bad then and ZFS is trying to recover now - but this only leads to hangs...

Is there a way to remotely trace what happens in the kernel which leads to this hang? (i.e. fix dtrace setup, maybe) For example, these reads from disks and growing kernel time percentage are intriguing to me... They were not there a couple of days ago!

Is there a way to actually enforce "zpool export -f pool" so that this export happens and the pool gets unmounted so I can try rolling back TXGs or whatever to get it consistent like it seemed to be on the weekend? I know "zfsck" before importing is too much to ask for ;)

Also, for developers: there should be some timeout watchdog which should abort attempts to use the pool, at least during export operations. At present, this situation does not let the system shutdown nor reboot properly - it seems that exporting the pool can take forever, at least several hours. This includes attempts like "reboot -qnl" which are supposed to bypass FS unmounting.


Also available in: Atom PDF