OI_148a locks up on any problems with ZFS pools
I have a 6-spindle raidz2 pool named "pool" on commodity hardware. Over that pool I've made a zvol which is exported and re-imported as an iSCSI device, and over that iSCSI device the same system creates another ZFS pool "dcpool".
Occasionally the system times out on some accesses to either pool (at least once due to a flaky SATA connector) and locks up completely - the X11 session is unresponsive, I can't SSH into the machine, it does not respond to power-button so I have to reset it.
I expected the ZFS pool to continue working properly (albeit slowly) in such case and perhaps trigger a resilver - even if one SATA connection is lost, the raidz2 pool without one disk still has an extra disk worth of parity data.
Also on some timeouts (i.e. the system is too busy processing data) the "dcpool"'s iSCSI localhost connection gets dropped, so the underlying device becomes UNAVAIL. This also causes system-wide lockup. I kinda hoped that the system would just fail the outstanding IOs (like on USB flash being yanked out) or wait for the device (iSCSI connection) to come back (like on NFS server reboot).
Instead, this system can hang for hours until it is reset. Which makes me nervous, at least. And it's not reliable as I'd hoped.
Some details regarding the "dcpool":
To be precise, the zvol is compressed in "pool" and for all datasets in "dcpool" I have enabled deduplication, so the system goes through write cycles faster. Instead of 'compress+dedup+dispose of common blocks' it can go like 'dedup+dispose of common blocks+compress unique blocks'.
The iSCSI loopback instead of making "dcpool" directly in the zvol was suggested by Darren Moffat:
When the system boots up, the "dcpool" is deemed "UNAVAIL" (it was imported and remains cached via zpool.cache). However, after the startup of "iscsi/target" and "iscsi/initiator" it is automatically found, imported and mounted by the time X11 session logs in.
I had to tweak the SMF service properties - initiator now depends on target, so that they start in proper order for this setup.
Updated by Jim Klimov over 10 years ago
I'd like to re-confirm that problems still occur.
At the moment the same machine is configured with OI_148a and runs steadily (rsync-pulling contents of other file servers). The pool with deduping datasets is configured to use USB sticks as "zpool cache" devices for L2ARC, and apparently that works.
Once in a while (approx one week) a USB device can get lost. Not sure what's the nature - mechanical or motherboard glitch, but the USB stick no longer responds and its loss is syslogged. Sometimes it comes back after a hard-reset - because the system can no longer access the ZFS pools and shut down/"reboot -qfln"/whatever without manual intervention.
Note that this is one way to try and safely reproduce the bug. I've experienced exact-similar hangs when a pool disk got disconnected from MB due to flaky SATA cabling, Back then I expected the pool to get degraded but continue operation with 5 out of 6 RAIDZ2 disks.
This is all very wrong because:
1) It may take some time for anybody to come around and push reset - thus loss of service for a while. This is a PC-class hardware with no remote-reboot BMC/IPMI card.
2) AFAIK, the L2ARC devices are deemed not critical for ZFS, i.e. if a block read from L2ARC does not match its checksum, it is supposed to be re-read from original pool data. In particular for this reason caches can't be mirrored by design.
Seen behavior: when a (USB-stick) cache device is lost (and even logged as such), the ZFS pool effectively hangs until hard-reset. Calls to "zpool" command for whatever action including "zpool status" never return. Software reboot of any form attempted never happens.
Desired behavior: L2ARC is (forcibly) removed or faulted in the pool, and ZFS operation continues with no interruption.
Updated by Jim Klimov over 10 years ago
I've had the computer rebooted remotely, and it seems like there's indeed a problem with the USB flash used as a cache device.
Accesses to it time out, HAL service and rmvolmgr failed to start.
Importing the pool which includes the device took about 35 minutes!
- time zpool import -o cachefile=none dcpool
- dmesg | grep ehci
Apr 17 19:57:19 bofh-sol usba: [ID 691482 kern.warning] WARNING: /pci@0,0/pci1043,81ec@1d,7 (ehci1): Connecting device on port 5 failed
Apr 17 19:59:29 bofh-sol last message repeated 2 times
Apr 17 20:00:33 bofh-sol usba: [ID 691482 kern.warning] WARNING: /pci@0,0/pci1043,81ec@1d,7 (ehci1): Connecting device on port 5 failed
Apr 17 20:03:47 bofh-sol last message repeated 3 times
Apr 17 20:18:26 bofh-sol usba: [ID 691482 kern.warning] WARNING: /pci@0,0/pci1043,81ec@1d,7 (ehci1): Connecting device on port 5 failed
Apr 17 20:19:31 bofh-sol last message repeated 1 time
Apr 17 20:20:35 bofh-sol usba: [ID 691482 kern.warning] WARNING: /pci@0,0/pci1043,81ec@1d,7 (ehci1): Connecting device on port 5 failed
Apr 17 20:25:57 bofh-sol last message repeated 5 times
Apr 17 20:27:01 bofh-sol usba: [ID 691482 kern.warning] WARNING: /pci@0,0/pci1043,81ec@1d,7 (ehci1): Connecting device on port 5 failed
I believe ZFS should be (configurably?) faster about giving up on devices, especially such non-essential for reliability and data availability as L2ARC cache storage.
Updated by Jim Klimov over 10 years ago
Listing available pools before import showed no status, good nor bad, on such extra devices":
- zpool import
action: The pool can be imported using its name or numeric identifier.
After the import succeeded (in over half an hour), one cache device is UNAVAIL:
- zpool status dcpool
scan: scrub canceled on Sun Mar 20 16:09:36 2011
NAME STATE READ WRITE CKSUM
dcpool ONLINE 0 0 0
c0t600144F09844CF0000004D8376AE0002d0 ONLINE 0 0 0
c4t1d0p6 ONLINE 0 0 0
c4t1d0p7 ONLINE 0 0 0
c2t0d0 UNAVAIL 0 0 0 cannot open
c3t0d0 ONLINE 0 0 0
errors: No known data errors
For now I removed the USB cache devices, crippling the performance a bit, hopefully I won't see this exact problem anymore.
Apparently, this is only a workaround for the particular case of USB-Flash L2ARC cache devices, and the wrong ZFS behavior applies to many other cases, not so easily waved away... :(
Updated by Jim Klimov about 10 years ago
Hello, thanks for looking into this ;)
Alas, after almost 4 months of ZFS adventures on this box behind, I can't responsibly say what was configured back then exactly.
Currently, and perhaps initially, pool (6-spindle) and rpool have failmode=continue, and dcpool (iscsi volume) has failmode=wait.
The USB sticks are out of equation for several months as well.
Even if failures with "pool" or "dcpool" would lock up their IOs by design, this should not cripple the OS from any likeness of being alive (X11, SSH, reboot, etc.) - and the rpool is a separate pool on a separate disk.
Updated by Ken Mays almost 10 years ago
- Due date set to 2011-09-14
- Category set to OS/Net (Kernel and Userland)
- Status changed from New to Feedback
- Target version set to oi_151_stable
- Estimated time set to 8.00 h
Have user image-update using latest dev-il repo and reverify issue.
Updated by Ken Mays over 8 years ago
- Status changed from Feedback to Closed
- Assignee changed from OI illumos to Ken Mays
- Target version deleted (
- % Done changed from 0 to 100
Jim - haven't gotten any feedback after a year and I used oi_151a7 to try to reduplicate a few issues. Not resolvable through the OI project besides kernel updates from Illumos. Please track through Illumos bug reports. Closing ticket.
Updated by Jim Klimov over 8 years ago
That particular setup (with a pool in an iSCSI-loopbacked zvol in a physical zpool) has been dismantled for a while, in favor of storing the data in the physical pool directly, so I also can't confirm or disprove whether the newer kernels have the issue solved - and how much the hardware flakiness should be blamed (it is certainly there, this home-peecee class system is still moderately unreliable).
It is understood that if I or anyone else sees such undesired behavior in the future, in an OS with the most up-to-date illumos kernel at that time, the new bug should be posted for the illumos-gate project (and might reference this bug for history or analogies, if appropriate).