Project

General

Profile

Bug #956

OI_148a zfs access to a pool hangs (in kernel)

Added by Jim Klimov over 8 years ago. Updated over 8 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2011-04-25
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

I guess I've beaten my backup machine (on which mosts of my reports were based) too hard now, to the point that many commands hang and do not return. The system usually becomes unresponsive within an hour after reboot (only reset works).

What I can gather now is that my main data pool is imported on every boot (I've disabled my "dcpool" over iSCSI, but this did not remedy the problem), filesystems are mounted and seem accessible. However even with no user activity there are some reads from the pool (ranging from 300Kb/sec to 3Mb/sec), while last week and before there were exact zeros with no user activity.

Kernel time grows until connection to the system is lost, and my assistant is by now tired of walking up to the box to press reset.

Attempts to export the pool, or to scrub it, or to set "canmount=noauto" on some datasets fail. Truss reports things like:

===== This is " truss zfs umount pool/media" end of story:
ioctl(9, MNTIOC_GETMNTANY, 0x08044DC0) = 1
open("/etc/dfs/sharetab", O_RDONLY) = 11
fstat64(11, 0x08045E00) = 0
fcntl(11, F_SETLKW, 0x08045FB0) Err#9 EBADF
brk(0x0813E000) = 0
fstat64(11, 0x08045ED0) = 0
fstat64(11, 0x08045DE0) = 0
ioctl(11, TCGETA, 0x08045E80) Err#22 EINVAL
read(11, " / e x p o r t / h o m e".., 512) = 284
read(11, 0x0811ACCC, 512) = 0
fcntl(11, F_SETLK, 0x08045FB0) = 0
llseek(11, 0, SEEK_CUR) = 284
close(11) = 0
getpid() = 1486 [1485]
door_call(12, 0x08045E60) = 0
getpid() = 1486 [1485]
door_call(12, 0x08045E60) = 0
brk(0x0814F000) = 0
door_call(7, 0x08136008) = 0
door_call(7, 0x08136008) = 0
door_call(7, 0x08136008) = 0
door_call(7, 0x08136008) = 0
stat("/pool/media", 0x08046380) (sleeping...) ===== Hangs forever after this point

On the most recent boot the boot-archive-update service also hung, and there's a hanging add_drv process as well.

"kill -9" or "kill -15" does not interrupt the hung processes, seems they are stuck in kernel,

So far I've found a similar-sounding problem with ZFS in thread http://mail.opensolaris.org/pipermail/dtrace-discuss/2008-April/005874.html

However DTrace doesn't seem to work either:
  1. dtrace -n 'profile-1000Hz{ [stack()]=count() }'
    dtrace: invalid probe specifier profile-1000Hz{
    [stack()]=count() }: "/usr/lib/dtrace/iscsit.d", line 36: ss_family is not a member of struct i_ndi_err

Similar errors happened to me on this box in single-user mode but never yet when the OS had fully booted.

That thread also mentions that the poster had similar problems when the pool was full. However, this pool reports as available at least 280Gb by "df" and 490Gb by "zpool list" counters. However, things seemed to have broken exactly when I was trying to clear up some old snapshots to recover space.

Possibly, something went bad then and ZFS is trying to recover now - but this only leads to hangs...

Is there a way to remotely trace what happens in the kernel which leads to this hang? (i.e. fix dtrace setup, maybe) For example, these reads from disks and growing kernel time percentage are intriguing to me... They were not there a couple of days ago!

Is there a way to actually enforce "zpool export -f pool" so that this export happens and the pool gets unmounted so I can try rolling back TXGs or whatever to get it consistent like it seemed to be on the weekend? I know "zfsck" before importing is too much to ask for ;)

Also, for developers: there should be some timeout watchdog which should abort attempts to use the pool, at least during export operations. At present, this situation does not let the system shutdown nor reboot properly - it seems that exporting the pool can take forever, at least several hours. This includes attempts like "reboot -qnl" which are supposed to bypass FS unmounting.

History

#1

Updated by Albert Lee over 8 years ago

  • Project changed from OpenIndiana Distribution to illumos gate
#2

Updated by Jim Klimov over 8 years ago

I keep updates on the problem in the forum posts - alas, with no answers or suggestions from other community members so far.

In short:
After updating the ZFS package to one released Mar 17, and removing the zpool.cache file, I managed to get the pool not auto-imported on start. However now the import attempts lead to disk thrashing for an hour, and within an hour all of the 8Gb RAM are consumed. No swapping begins for some reason so the system dies in a scanrate loop (about 6 million scans per second).
I tried to "zdb" the pool afterwards, and keep trying now. When traversing the blocks to validate checksums with "zdb", the system had similar behavior - however, it did begin swapping. It died after about 12 hours with similar symptoms though, but the ZDB process had almost 20Gb VM allocation.

From this I can just guess that the pool is being checked and traversed by the "zpool import" attempt, but probably as these checks are done in-kernel, they can't be swapped out?

I have not yet tried to roll back transactions as in "zpool import -F", which I very much hope would recover the pool to working state - for now I hope to get as much knowledge from ZDB and maybe provide some info to ZFS developers? Alas, the system is remote so it is unlikely I can generate a crash dump (though I'd love to know what causes the scanrate hell and renders the machine unresponsive), but other than that - I hope I can "dtrace" or "zdb" some info about the situation. Ping me on the forum to ask for info :^)

NOTE: as a workaround for the DTrace problem I moved away "/usr/lib/dtrace/iscsit.d". Not being an expert, I can't see what is wrong with this file - it seems validly writtem like many others in the directory. Probably includes some wrong file, or doesn't include what's needed?

ss_family in particular seems to be related to IP addressing structures and indeed shouldn't be found in error(?) structures...

#3

Updated by Jim Klimov over 8 years ago

Ah, I thought I had the forum post link in text above. Here's my monolog/log:

http://opensolaris.org/jive/thread.jspa?threadID=138740

#4

Updated by Jim Klimov over 8 years ago

  • Difficulty set to Medium
  • Tags set to needs-triage

Apparently, there were some incomplete deferred deletions on the pool, and each import attempt took a shot at deleting some more blocks. After a week of hangs and resets, the pool imported cleanly.

The BUG part here is the need for resets (or for my watchdog crafted to catch out-of-memory conditions and preventively reboot via uadmin calls).

"zpool import" routinely consumed all 8Gb of RAM, and did not swap out - unlike "zdb" traversals which seem to do many of the same tasks, but could swap out (left 128Mb free) and consumed about 20Gb of VM before successfully completing. On the other hand, "zpool import" led the system free memory to about 32-40Mb free, then the scan rates jumped into the millions and the system became unresponsive - boot after boot.

I believe 8Gb is not a little amount of memory, and for many home users (large part of intended enthusiast community) the motherboard won't support more RAM anyway.

So I propose that the ZFS traversal/repair code responsible for this should be reviewed, and it should either be allowed to swap out, or it should rely on smaller chunks of data, or both.

The zpool import itself may be allowed to take days in background or foreground - preferably with due warning - but it should not get the system frozen after a few minutes of importing work.

//Jim Klimov

#5

Updated by Garrett D'Amore over 8 years ago

Are you using dedup? Dedup requires a prodigious amount of RAM, or very fast disks (i.e. SSDs.)

#6

Updated by Jim Klimov over 8 years ago

Yes, as described in the links (I had little hope anyone would be able to read that much though ;) )

According to the DDT histogram, the DDT size should not be too big though. To cite myself: ===
One thing that kept me worrying was if DDT wouldn't fit in RAM and thus would exceed it. Seems not true: according to "zdb -e pool" I have about 4 million entries sized as seen below. Worst-case is on-disk size of 1.5kb yielding 6Gbytes for the whole DDT. RAM DDT image should be within 1Gb now. So it is not DDT hogging my RAM into swap-hell...

DDT-sha256-zap-ditto: 6 entries, size 8192 on disk, 1365 in core
DDT-sha256-zap-duplicate: 1923138 entries, size 1110 on disk, 179 in core
DDT-sha256-zap-unique: 2301446 entries, size 1501 on disk, 242 in core ===

Is there some other REQUIRED (non-discardable, non-swappable) overhead to using dedup other than this 1Gb of RAM for DDT (which as discussed on forums may be evicted as partof of ARC and incur lots of reads from disk - nothing so fatal)?

Anyhow, the system should be sturdier than just up and dying with no error reports or warnings...

Thanks,
//Jim

Also available in: Atom PDF