Project

General

Profile

Bug #883

ZIL reuse during remount can lead to data corruption

Added by Eric Schrock over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
kernel
Start date:
2011-04-05
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

In the ZIL, each objset is associated with a zilog_t structure. The zilog keeps
track of, among other things, the log write buffers (LWBs) of outstanding log
records. These reflect on-disk ZIL records, and are cleaned up whenever a txg
is synced or when the dataset is unmounted.

The general lifecycle of a zilog looks like this:

zil_alloc()
zil_open()
zil_replay()
zil_parse()
zil_destroy()
zil_create()
[ zil_lwb_write_start() / zil_sync() ] loop
zil_close()
zil_free()

The distinction between zil_alloc()/zil_open() and zil_close()/zil_free() is
important. We call zil_alloc() when an objset is opened, but
zil_open()/zil_replay() is only called when the dataset is mounted. Similarly,
we call zil_close() when the dataset is unmounted, but zil_free() when the
dataset is evicted (via the last reference going away). The implications of
this is that if there are active holds on a dataset, it's possible to call
zil_open()/zil_replay()/zil_close() multiple times without calling
zil_alloc()/zil_free(), because the dataset is never evicted during the remount
process.

The ZIL maintains a special "stub" block that represents the end of the list.
In zil_close(), we free any active LWBs, but leave one LWB for the stub block.
This is cleaned up by zil_free():

/*
         * After zil_close() there should only be one lwb with a buffer.
*/
head_lwb = list_head(&zilog->zl_lwb_list);
if (head_lwb) {
ASSERT(head_lwb list_tail(&zilog->zl_lwb_list));
list_remove(&zilog->zl_lwb_list, head_lwb);
zio_buf_free(head_lwb->lwb_buf, head_lwb->lwb_sz);
kmem_cache_free(zil_lwb_cache, head_lwb);
}
list_destroy(&zilog->zl_lwb_list)

This separation of cleanup duties can lead to data corruption in the events of a
remount while the dataset is held. What happens is that we come through
zil_close(), leaving this lingering LWB on the list. Instead of going through
zil_free(), we instead go through zil_replay() and then zil_destroy(), which
does the following:

if (!list_is_empty(&zilog->zl_lwb_list)) {
ASSERT(zh->zh_claim_txg 0);
ASSERT(!keep_first);
while ((lwb = list_head(&zilog->zl_lwb_list)) != NULL) {
list_remove(&zilog->zl_lwb_list, lwb);
if (lwb->lwb_buf != NULL)
zio_buf_free(lwb->lwb_buf, lwb->lwb_sz);
zio_free_zil(zilog->zl_spa, txg, &lwb->lwb_blk);
kmem_cache_free(zil_lwb_cache, lwb);
}
} else if (!keep_first) {
(void) zil_parse(zilog, zil_free_log_block,
zil_free_log_record, tx, zh->zh_claim_txg);
}

Normally, this is called with an empty LWB list and 'keep_first = B_TRUE'. But
because of this bug, we instead go through the LWB cleanup code. On a DEBUG
kernel, we'll blow up on the "ASSERT" assertion, but for non-DEBUG,
we'll happily call zio_free_zil() on the stub block. When we later call
zil_create(), we will re-create this LWB based on the ZIL header block (zh_log).
At this point, we will have an active reference to a freed block, after which
point all kinds of things can go wrong. The most common manifestation is that
we later free the block again (perhaps by going through the above sequence
again), after which we'll end up with two free records for the same block in the
space map.

Once this happens, the pool is corrupted and won't be openable. It is
theoretically possible to rebuild the spacemap after this happens (collapsing
overlapping records) to repair such pools, but this is a non-trivial fix.

Moving the zil_free() cleanup to zil_close() prevents this problem from
occurring in the first place. The fix will also include some assertions to make
sure the LWB list is clear, and ztest changes to induce this failure mode.

I don't know how to assign myself as the RE for this bug, but I'll have the fix
ready shortly.


Related issues

Related to illumos gate - Bug #1175: ZFS rpool boot fails on oi148Feedback2011-07-01

Actions

History

#1

Updated by Albert Lee over 8 years ago

  • Assignee set to Eric Schrock

Thanks, Eric!

#2

Updated by Eric Schrock over 8 years ago

  • Status changed from New to Resolved
  • Difficulty set to Medium
  • Tags set to needs-triage

changeset: 13380:161b964a0e10
tag: tip
user: Eric Schrock <Eric.Schrock at delphix.com>
date: Tue May 31 13:48:11 2011 -0700

description:
883 ZIL reuse during remount can lead to data corruption
Reviewed by: Matt Ahrens <Matt.Ahrens at delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal at delphix.com>
Reviewed by: Albert Lee <trisk at nexenta.com>
Reviewed by: Gordon Ross <gwr at nexenta.com>
Reviewed by: Garrett D'Amore <garrett at nexenta.com>
Reivewed by: Dan McDonald <danmcd at nexenta.com>
Approved by: Gordon Ross <gwr at nexenta.com>

Also available in: Atom PDF