Bug #883


ZIL reuse during remount can lead to data corruption

Added by Eric Schrock about 12 years ago. Updated about 12 years ago.

Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


In the ZIL, each objset is associated with a zilog_t structure. The zilog keeps
track of, among other things, the log write buffers (LWBs) of outstanding log
records. These reflect on-disk ZIL records, and are cleaned up whenever a txg
is synced or when the dataset is unmounted.

The general lifecycle of a zilog looks like this:

[ zil_lwb_write_start() / zil_sync() ] loop

The distinction between zil_alloc()/zil_open() and zil_close()/zil_free() is
important. We call zil_alloc() when an objset is opened, but
zil_open()/zil_replay() is only called when the dataset is mounted. Similarly,
we call zil_close() when the dataset is unmounted, but zil_free() when the
dataset is evicted (via the last reference going away). The implications of
this is that if there are active holds on a dataset, it's possible to call
zil_open()/zil_replay()/zil_close() multiple times without calling
zil_alloc()/zil_free(), because the dataset is never evicted during the remount

The ZIL maintains a special "stub" block that represents the end of the list.
In zil_close(), we free any active LWBs, but leave one LWB for the stub block.
This is cleaned up by zil_free():

 * After zil_close() there should only be one lwb with a buffer.
head_lwb = list_head(&zilog->zl_lwb_list);
if (head_lwb) {
ASSERT(head_lwb list_tail(&zilog->zl_lwb_list));
list_remove(&zilog->zl_lwb_list, head_lwb);
zio_buf_free(head_lwb->lwb_buf, head_lwb->lwb_sz);
kmem_cache_free(zil_lwb_cache, head_lwb);

This separation of cleanup duties can lead to data corruption in the events of a
remount while the dataset is held. What happens is that we come through
zil_close(), leaving this lingering LWB on the list. Instead of going through
zil_free(), we instead go through zil_replay() and then zil_destroy(), which
does the following:

if (!list_is_empty(&zilog->zl_lwb_list)) {
ASSERT(zh->zh_claim_txg 0);
while ((lwb = list_head(&zilog->zl_lwb_list)) != NULL) {
list_remove(&zilog->zl_lwb_list, lwb);
if (lwb->lwb_buf != NULL)
zio_buf_free(lwb->lwb_buf, lwb->lwb_sz);
zio_free_zil(zilog->zl_spa, txg, &lwb->lwb_blk);
kmem_cache_free(zil_lwb_cache, lwb);
} else if (!keep_first) {
(void) zil_parse(zilog, zil_free_log_block,
zil_free_log_record, tx, zh->zh_claim_txg);

Normally, this is called with an empty LWB list and 'keep_first = B_TRUE'. But
because of this bug, we instead go through the LWB cleanup code. On a DEBUG
kernel, we'll blow up on the "ASSERT" assertion, but for non-DEBUG,
we'll happily call zio_free_zil() on the stub block. When we later call
zil_create(), we will re-create this LWB based on the ZIL header block (zh_log).
At this point, we will have an active reference to a freed block, after which
point all kinds of things can go wrong. The most common manifestation is that
we later free the block again (perhaps by going through the above sequence
again), after which we'll end up with two free records for the same block in the
space map.

Once this happens, the pool is corrupted and won't be openable. It is
theoretically possible to rebuild the spacemap after this happens (collapsing
overlapping records) to repair such pools, but this is a non-trivial fix.

Moving the zil_free() cleanup to zil_close() prevents this problem from
occurring in the first place. The fix will also include some assertions to make
sure the LWB list is clear, and ztest changes to induce this failure mode.

I don't know how to assign myself as the RE for this bug, but I'll have the fix
ready shortly.

Related issues

Related to illumos gate - Bug #1175: ZFS rpool boot fails on oi148Feedback2011-07-01

Actions #1

Updated by Albert Lee about 12 years ago

  • Assignee set to Eric Schrock

Thanks, Eric!

Actions #2

Updated by Eric Schrock about 12 years ago

  • Status changed from New to Resolved
  • Difficulty set to Medium
  • Tags set to needs-triage

changeset: 13380:161b964a0e10
tag: tip
user: Eric Schrock <Eric.Schrock at>
date: Tue May 31 13:48:11 2011 -0700

883 ZIL reuse during remount can lead to data corruption
Reviewed by: Matt Ahrens <Matt.Ahrens at>
Reviewed by: Adam Leventhal <Adam.Leventhal at>
Reviewed by: Albert Lee <trisk at>
Reviewed by: Gordon Ross <gwr at>
Reviewed by: Garrett D'Amore <garrett at>
Reivewed by: Dan McDonald <danmcd at>
Approved by: Gordon Ross <gwr at>


Also available in: Atom PDF