zpools going south
There have been several reports of zpools that couldn't be imported anymore. We also had one incident here. The common pattern is that there is corrupt metadata on disk, covered by a good checksum.
We have been able to catch such a corruption with a debug kernel, before it got written to disk:
vpanic() 0xfffffffff781fe15() arc_buf_thaw+0x42(ffffff24b4a598c0) arc_release+0x3bf(ffffff24b4a598c0, ffffffd578e50318) dbuf_release_bp+0x6a(ffffffd578e50318) dbuf_write+0x98(ffffff17de5c5550, ffffff24b4a598c0, ffffffd6d8395c00) dbuf_sync_indirect+0x144(ffffff17de5c5550, ffffffd6d8395c00) dbuf_sync_list+0xac(ffffff122e296a78, 1, ffffffd6d8395c00) dnode_sync+0x46c(ffffff122e296970, ffffffd6d8395c00) dmu_objset_sync_dnodes+0x93(ffffff11fe4f7e50, 0, ffffffd6d8395c00) dmu_objset_sync+0x1d6(ffffff11fe4f7c40, fffffff70e8af388, ffffffd6d8395c00) dsl_pool_sync_mos+0x42(ffffff12135f0a00, ffffffd6d8395c00) dsl_pool_sync+0x2fe(ffffff12135f0a00, 23d59d) spa_sync+0x27e(ffffff123b2d8000, 23d59d) txg_sync_thread+0x260(ffffff12135f0a00) thread_start+8() panic message "buffer modified while frozen"
Note that the last minute check arc_verify_checksum that saves us here is only present with ZFS_DEBUG_MODIFY turned on.
From the crashdump I have been able to reconstruct the following chain of events.
- We start with an arc_buf_hdr_t in state mru.
- The buffer gets picked by l2arc_write_buffers to get written to l2arc
- In the first stage of l2arc_write_buffers, the buffer is fully locked (hash_lock and l2ad_mtx held) and marked to be written to l2arc by flagging it with ARC_FLAG_L2_WRITING. The l2hdr is initialized with compression set to OFF and b_asize set to to the buffer's size.
- In the second stage, the hash_lock is given up. From this point on, we're only allowed to touch l2hdr fields, as two comments in l2arc_write_buffers clearly state:
* We shouldn't need to lock the buffer here, since we flagged * it as ARC_FLAG_L2_WRITING in the previous step, but we must * take care to only access its L2 cache parameters. In
* Note, we can't use a new flag to distinguish * the two stages because we don't hold the * header's hash_lock below, in the second stage * of this function. Thus, we can't simply * change the b_flags field to denote that the * IO has been sent. We can change the b_daddr
- Nevertheless, in the second stage, where the buffer is being compressed and sent to disk, l2arc_compress_buf is called which might manipulate the compression flags with HDR_SET_COMPRESSION. But as the compression flags are part of the general b_flags field and not in the l2hdr-specific fields, we are completely unprotected here!
- In the meantime the same buffer gets selected for eviction by the arc_reclaim_thread.
- arc_evict_hdr is called on this buf with hash_lock held
- arc_buf_destroy is called, which defers the free, because ARC_FLAG_L2_WRITING is set
- arc_change_state is called with new_state mru_ghost
- after that, b_flags gets manipulated:
hdr->b_flags |= ARC_FLAG_IN_HASH_TABLE; hdr->b_flags &= ~ARC_FLAG_BUF_AVAILABLE;
- This is the point where the damage is done. This code runs in parallel with the (unprotected) manipulation of the compression flags above. As a result, the changes to the compression flags get lost. Note that at this point it could probably also as well be the other way round, with a completely different error pattern.
- Back to the compression thread: the buffer gets written to l2arc correctly, but the state recorded in the arc_buf_hdr is wrong: b_asize is the compressed size, while the flags state that compression is OFF.
- A while later the buffer gets read back. arc_read triggers the read from l2arc, with l2arc_read_done as completion. The size of the read is b_asize, the compressed size.
- Because the buffer is not marked as compressed, l2arc_read_done doesn't call l2arc_decompress_zio
- Thus the arc_cksum_equal fails and arcstat_l2_cksum_bad gets incremented
- There's no io_waiter and a re-read from primary storage is triggered, using zio->io_size as size, which is the compressed size instead of the original size. Normally, l2arc_decompress_size corrects zio->io_size back to the original size, which is a bit convoluted in itself, but works if the flags are consistent with each other. In this case, we end up with a short read from primary storage.
- As the data is compressed on primary storage, it gets read to a temporary buffer.
- After the read, the checksum is checked against the compressed data and found to be correct.
- The data gets decompressed to the arc_buf, but with the wrong size. Thus we end up with a partial decompression, leaving the rest of the buffer uninitialized. In our case it was filled with 0xbaddcafe, but I've also seen random pieces of data here instead.
- The calling thread now releases the arc_buf, which discards the freeze_cksum. In case of ZFS_DEBUG_MODIFY, this panics because the cksum is checked first. On non-debug runs, the corruption goes unnoticed.
- The buffer is modified and written back to disk, bestowed with shiny new checksums.
- The zpool is broken.
So the root cause is the unprotected access to b_flags to set the compression mode. This changed recently with the commit
Author: Chris Williamson <Chris.Williamson@delphix.com> Date: Mon Dec 29 19:12:23 2014 -0800 5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <firstname.lastname@example.org> Reviewed by: George Wilson <email@example.com> Reviewed by: Matthew Ahrens <firstname.lastname@example.org> Reviewed by: Don Brady <email@example.com> Reviewed by: Josef 'Jeff' Sipek <firstname.lastname@example.org> Approved by: Garrett D'Amore <email@example.com>
which moved the l2hdr-field b_compress into the generic flags to save memory footprint. The solution is to reintroduce b_compress. This doesn't even cost memory, as due to structure alignment there are still 4 unused bytes in l2arc_buf_hdr_t after b_asize. It also adds some ASSERTs as proposed by George Wilson.
George also suspected that the reports of spurious checksum errors might go back to the same cause.