Project

General

Profile

Bug #11619

When sequential resilvering is enabled, under certain conditons, it can cause checksum errors. This problem is easy to repro on SSD based systems.

Added by Sanjay Nadkarni 21 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:

Description

Here's a script that can repro this on NSStor systems. Note that while the Linux sequential resilvering code is based on NSStor, there are quite other changes in that code. I am filing this bug here to see if Illumos sees these issues now that the the ZoL and Ilumos are in sync.

#!/bin/ksh -exu
while true; do
if zpool status tank | grep -q c0t5000A72030069982d0; then
pool detach tank c0t5000A72030069982d0
fi

pool attach tank c0t5000A7203006998Bd0 c0t5000A72030069982d0

while zpool status tank | grep -q resilvering; do
sleep $(( RANDOM % 30 ))
done
zpool status tank
if zpool status tank | grep c0t5 | grep -v '0$'; then
zpool status tank
exit
fi
done

zpool status tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Tue Sep 18 12:15:36 2018
2.23G scanned, 2.23G verified out of 3.18G at 380M/s, 69.98% done
0 repaired, 0h0m to go
trim: none requested
config:

NAME                       STATE     READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t5000A7203006998Bd0 ONLINE 0 0 0
c0t5000A72030069982d0 ONLINE 0 0 372

errors: No known data errors

This has been fixed in our tree.
Comments from engineer:
The core problem seems to have been caused by a mistaken assumption in my scan handling code where I assumed, since it was occurring in transaction group commit, that it couldn't be called in a multi-threaded fashion. This appears to have been in error. I'm not sure if multi-threaded behavior was only introduced by an upstream feature merge (multi-threaded spa sync), or if this issue has always been there.
commit 6a825cbea9b40efcea69c8166663135d29b1abf6
Author: Saso Kiselkov <>
Date: Mon Oct 8 16:14:19 2018 -0600

NEX-18589 checksum errors on SSD-based pool
Reviewed by: Roman Strashkin &lt;&gt;
Reviewed by: Sanjay Nadkarni &lt;&gt;

Also available in: Atom PDF