Project

General

Profile

Bug #7341

Pool halt due to checksum error after reboot

Added by Arne Jansen over 3 years ago. Updated over 3 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
zfs - Zettabyte File System
Start date:
2016-08-30
Due date:
% Done:

0%

Estimated time:
Difficulty:
Expert
Tags:
needs-triage

Description

We encountered a suspended pool after reboot.

Short summary: persistent l2arc provided an old checksum for a L0-space map block.

Blocked thread:

> ffffff007bf93c40::findstack -v
stack pointer for thread ffffff007bf93c40: ffffff007bf936c0
[ ffffff007bf936c0 _resume_from_idle+0xf4() ]
  ffffff007bf936f0 swtch+0x141()
  ffffff007bf93730 cv_wait+0x70()
  ffffff007bf93770 zio_wait+0x5b()
  ffffff007bf937e0 dbuf_read+0x2c0()
  ffffff007bf93820 dmu_buf_will_dirty+0x5c()
  ffffff007bf938d0 dmu_write+0xc1()
  ffffff007bf939a0 space_map_write+0x306()
  ffffff007bf93a40 metaslab_sync+0x10b()
  ffffff007bf93aa0 vdev_sync+0x7f()
  ffffff007bf93b70 spa_sync+0x2c3()
  ffffff007bf93c20 txg_sync_thread+0x21f()
  ffffff007bf93c30 thread_start+8()

zio:

> ffffff15b59b9c40::zio -r
ADDRESS                 TYPE  STAGE            WAITER           TIME_ELAPSED
ffffff15b59b9c40        NULL  DONE             ffffff007bf93c40 -            
 ffffff1521f00aa0       READ  FAILED           -                -            

The read failed due to a checksum error.
Expected checksum: 0x95506ab4f8, 0xd90d777c8bfd, 0xb94217234c8434, 0x74f15af293dd3aab
Actual checksum: 0x9601f5ebc2, 0xd9512bebc64e, 0xb94db62f79cd2b, 0x74f260b9c17568c5

The checksum algorithm used is Fletcher_4. In Fletcher_4, the first word is just the sum of all 32 bit words of the data block. You can see that the actual checksum is slightly larger (first word) than the expected one.
As space maps are mostly append-only, I removed the last 2 entries from the space map block and checksummed again. The checksum now matched the expected one.
Here is an excerpt from the space map:

action 1 syncpass 1 txg 22990083
offset a02bc49 type 1 run 6
offset a02bc6d type 1 run 6
offset a02f963 type 1 run 6
offset a02f96f type 1 run 24
offset a02f99f type 1 run 6
offset b9b0209 type 1 run 384
offset b9bdb89 type 1 run 384
offset b9bde89 type 1 run 2304
offset ba12309 type 1 run 1152
offset ba2e509 type 1 run 3456
offset ba2f409 type 1 run 2304
action 1 syncpass 2 txg 22995718
offset 43a1099 type 1 run 6
offset 43a10a5 type 1 run 6
offset 43a110b type 1 run 6
offset 43a1141 type 1 run 6
offset 6dbf64b type 1 run 6
offset 6ddcd69 type 1 run 12
action 1 syncpass 2 txg 22999492
offset c7aec8 type 1 run 3
offset 23e29e1 type 1 run 3
action 1 syncpass 1 txg 22999548
offset 105eecb type 1 run 6
offset 105eee3 type 1 run 768
offset 105f1e9 type 1 run 384
offset 105f36f type 1 run 384
offset 105f4f5 type 1 run 768
offset 105f7fb type 1 run 3846
action 1 syncpass 2 txg 22999548
offset 3a67427 type 1 run 3

So the L1-block (the blkptr-array) contains a checksum to an older version of the block. Interestingly enough, zdb doesn't see a checksum error. Dumping the L1-block with zdb gives a block that is identical to the one in memory, except that single checksum.

Our code base contains the persistent l2arc-patch, pulled from the nexenta repo shortly before it got upstreamed. The l2arc contains the L1-block, but not the L0-block. It looks like the old version of the L1-block somehow made it to disk.

To confirm that the problem is really related to persistent l2arc we removed and re-added the cache devices. With the next reboot, the problem was gone.

Another clue to the problem might be that the space map got modified twice during that txg, the space map has entries for syncpass 1 and 2.


Related issues

Related to illumos gate - Feature #3525: Persistent L2ARCFeedback2013-02-04

Actions

History

#1

Updated by Arne Jansen over 3 years ago

#2

Updated by Marcel Telka over 3 years ago

  • Category set to zfs - Zettabyte File System
#3

Updated by Arne Jansen over 3 years ago

I think the problem is quite simple: Before a buffer gets modified, it is released from l2arc via arc_release. This makes sure the outdated data in l2arc never gets read again. But this only releases the memory reference to the data, not the persistent data in l2arc itself. So after a reboot, the l2arc gets scanned and a reference to that data gets constructed, so a read will read from l2arc.

I'm not sure how checksum handling is done, but afair the l2arc has its own checksums. The checksum from the bp never gets checked, so the mismatch goes unnoticed.

As a solution it might be possible to also write release information to l2arc, so these buffers already get released again during scanning. But this has to be done in sync with the txgs so there are no race conditions.
Another way might be to also check the bp-checksum, but I guess there are good reasons not to do so currently.

#4

Updated by Sašo Kiselkov over 3 years ago

Arne Jansen wrote:

I'm not sure how checksum handling is done, but afair the l2arc has its own checksums. The checksum from the bp never gets checked, so the mismatch goes unnoticed.

Before writing to the L2ARC, we compute the checksum and store it in the arc_buf_hdr_t (b_freeze_cksum). On read back, it's checked against this. Obviously, we store the wrong checksum here, since we wrote the old data out (including the persistence header with the old b_freeze_cksum value).

As a solution it might be possible to also write release information to l2arc, so these buffers already get released again during scanning. But this has to be done in sync with the txgs so there are no race conditions.
Another way might be to also check the bp-checksum, but I guess there are good reasons not to do so currently.

I think the simplest solution is to just put in a check in l2arc_write_eligible that the b_birth of a block is less than the spa_syncing_txg, thus no chance of further modification occurring after it has been L2 cached, but before the txg rolls over.

Also available in: Atom PDF