Deleting blocks on (deduped) pool consumes all RAM and freezes the system
As conveyed in my bug #956, some of my OpenSolaris Forum posts and recent mail-list activity (links below), my "zpool import" on oi_148a often takes very long to complete, presumably due to going through deferred-free blocks on a deduped pool. As Garrett said, DDT takes a prodigous amount of RAM.
In fact, after a while the system runs out of RAM regardles of configured swap areas, and hangs in an infinite loop of high scan rates (there are no free pages to find anyway - kernel holds them all and does not release, I guess). Also any zfs-related programs spawned after the import attempt, hang.
This also happens during the initial deletion of a large amount of files, of a snapshot or a dataset - processing of this delete often ultimately hangs the system and leads to a pool which is not importable for many days and many reboots (reset-button style).
Luckily, much of the progress gained during each pool import attempt (which seems to go through these deferred deletes) is cumulative, and after a week or so of intensive disk thrashing and reboots every 3 hours, my pool becomes available again.
This means that at the very least, the delete-processing can be forced (by kernel) to abort and restart instead of consuming all memory and freezing the system, and after a (long) while the block-freeing task would complete. Obviously, some cleaner and less disruptive solutions are possible, but this quick workaround should work and improve the situation.
Problem reproduction details:
Fill up several (3-4) Tb of data on a deduped pool with some datasets (preferably on a machine not deemed proper for current dedup standards - if only the public knew what the required specs are!), create some snapshots along the way.
Deleting the snapshots and/or datasets should lead to system hanging. (Actually the point is it shouldn't, but in terms of reproducing the bug ;) ...)
Deleted dataset sizes are approx 500Gb, on a raidz2 pool of 6 HDDs with DDT, P4 (64-bit DualCore) CPU, 8Gb RAM, no SSD. The incidents occurred both with the physical pool and with a pool available via an iSCSI device backed by a volume in the physical pool.Links to external detailed info:
Updated by Jim Klimov over 8 years ago
For an illustration, here's a diff of two "zdb -bsvL" runs, taken before and after one of many failed zpool import attempts. As can be seen below, about 0.2M of 6.5M deferred-free blocks were released in this import run.
If this rate is sustained, it would take me about 30 reboots and 90 hours (it took roughly 3 hours to crash on zpool import in the past few attempts; both zdb walks took about 6 hours each), in the "best" scenario to just get this pool imported.
Obviously, I would love to have it imported faster and can wait a few hundred hours without crashes for the disk block to actually become available for new writes (with deferred-frees processed as a safe background operation), while I would be able to write into currently-available blocks already.
bu zdb* zdbwalk-bsvL-dcpol-20110611.txt Sun Jun 12 06:05:33 2011
+++ zdbwalk-bsvL-dcpol-20110613.txt Mon Jun 13 15:06:39 2011
Traversing all blocks ...
- bp count: 45586715
- bp logical: 4912471494144 avg: 107761
- bp physical: 4895044082176 avg: 107378 compression: 1.00
- bp allocated: 4903106471936 avg: 107555 compression: 1.00
- bp deduped: 698169696256 ref>1: 4479270 deduplication: 1.14
- SPA allocated: 4204936775680 used: 97.93%
+ bp count: 45360428
+ bp logical: 4886347953152 avg: 107722
+ bp physical: 4868966358016 avg: 107339 compression: 1.00
+ bp allocated: 4877002913280 avg: 107516 compression: 1.00
+ bp deduped: 672150567424 ref>1: 4310939 deduplication: 1.14
+ SPA allocated: 4204852345856 used: 97.93%
Blocks LSIZE PSIZE ASIZE avg comp %Total Type
- - - - - - - unallocated
@ -22,10 +25,10@
64 47.5K 35.5K 107K 1.66K 1.34 0.00 object array
2 32K 17.0K 51.0K 25.5K 1.88 0.00 packed nvlist
- - - - - - - packed nvlist size
- 16.9K 2.10G 360M 1.06G 63.9K 5.96 0.02 bpobj
+ 16.7K 2.07G 351M 1.03G 63.0K 6.04 0.02 bpobj
- - - - - - - bpobj header
- - - - - - - SPA space map header
- 19.2K 118M 46.7M 140M 7.31K 2.53 0.00 SPA space map
+ 20.1K 122M 49.6M 149M 7.39K 2.46 0.00 SPA space map
1 128K 128K 128K 128K 1.00 0.00 ZIL intent log
151K 2.35G 536M 1.05G 7.12K 4.50 0.02 DMU dnode
49 98.0K 26.0K 54.0K 1.10K 3.77 0.00 DMU objset
@ -36,7 +39,7@
- - - - - - - DSL dataset
- - - - - - - ZFS znode
- - - - - - - ZFS V0 ACL
- 35.6M 3.98T 3.98T 3.98T 114K 1.00 89.17 ZFS plain file
+ 35.6M 3.98T 3.98T 3.98T 114K 1.00 89.65 ZFS plain file
381K 654M 256M 512M 1.34K 2.56 0.01 ZFS directory
5 5.00K 2.50K 5.00K 1K 2.00 0.00 ZFS master node
6 29.0K 3.50K 7.00K 1.17K 8.29 0.00 ZFS delete queue
@ -59,7 +62,7@
96 142K 53.0K 106K 1.10K 2.67 0.00 ZFS user/group used
- - - - - - - ZFS user/group quota
- - - - - - - snapshot refcount tags
- 841K 3.28G 1.89G 5.67G 6.90K 1.74 0.12 DDT ZAP algorithm
+ 841K 3.29G 1.89G 5.67G 6.90K 1.74 0.12 DDT ZAP algorithm
2 32K 4K 12.0K 6.00K 8.00 0.00 DDT statistics
16 8K 8K 16K 1K 1.00 0.00 System attributes
5 2.50K 2.50K 5.00K 1K 1.00 0.00 SA master node
@ -71,12 +74,13@
- - - - - - - DSL deadlist map hdr
1 1.50K 512 1.50K 1.50K 3.00 0.00 DSL dir clones
63 7.88M 284K 851K 13.5K 28.44 0.00 bpobj subobj
- 6.53M 490G 485G 486G 74.4K 1.01 10.65 deferred free
- 387 11.3M 11.3M 11.5M 30.3K 1.00 0.00 dedup ditto
- 43.5M 4.47T 4.45T 4.46T 105K 1.00 100.00 Total
+ 6.32M 465G 461G 462G 73.1K 1.01 10.17 deferred free
+ 386 11.3M 11.3M 11.5M 30.4K 1.00 0.00 dedup ditto
+ 43.3M 4.44T 4.43T 4.44T 105K 1.00 100.00 Total
capacity operations bandwidth ---- errors ----
description used avail read write read write read write cksum
dcpool 3.82T 82.8G 30 0 1.05M 0 0 0 0/dev/dsk/c0t600144F09844CF0000004D8376AE0002d0s0 3.82T 82.8G 30 0 1.05M 0 0 0 0
-Sun Jun 12 02:47:20 MSD 2011
/dev/dsk/c0t600144F09844CF0000004D8376AE0002d0s0 3.82T 82.9G 30 0 1.04M 0 0 0 0
Updated by Jim Klimov over 8 years ago
Just in case, here's also a diff of "zdb -D" outputs before and after that import. Strangely (for me) the DDT entry sizes changed...
--- /home/jim/zdbwalk-D-dcpol-20110611.txt Sun Jun 12 06:05:33 2011
+++ /home/jim/zdbwalk-D-dcpol-20110613.txt Mon Jun 13 23:13:40 2011
@ -1,20 +1,43 @
-DDT-sha256-zap-ditto: 387 entries, size 381 on disk, 201 in core
-DDT-sha256-zap-duplicate: 4478883 entries, size 339 on disk, 213 in core
-DDT-sha256-zap-unique: 15532153 entries, size 294 on disk, 158 in core
+Tue Jun 14 00:29:41 MSD 2011
+DDT-sha256-zap-ditto: 386 entries, size 382 on disk, 201 in core
+DDT-sha256-zap-duplicate: 4310553 entries, size 345 on disk, 222 in core
+DDT-sha256-zap-unique: 15700168 entries, size 292 on disk, 156 in core
-dedup = 1.30, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.30
+dedup = 1.29, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.29