Project

General

Profile

Bug #11619

When sequential resilvering is enabled, under certain conditons, it can cause checksum errors. This problem is easy to repro on SSD based systems.

Added by Sanjay Nadkarni 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:

Description

Here's a script that can repro this on NSStor systems. Note that while the Linux sequential resilvering code is based on NSStor, there are quite other changes in that code. I am filing this bug here to see if Illumos sees these issues now that the the ZoL and Ilumos are in sync.

#!/bin/ksh -exu
while true; do
if zpool status tank | grep -q c0t5000A72030069982d0; then
pool detach tank c0t5000A72030069982d0
fi

pool attach tank c0t5000A7203006998Bd0 c0t5000A72030069982d0

while zpool status tank | grep -q resilvering; do
sleep $(( RANDOM % 30 ))
done
zpool status tank
if zpool status tank | grep c0t5 | grep -v '0$'; then
zpool status tank
exit
fi
done

zpool status tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Tue Sep 18 12:15:36 2018
2.23G scanned, 2.23G verified out of 3.18G at 380M/s, 69.98% done
0 repaired, 0h0m to go
trim: none requested
config:

NAME                       STATE     READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t5000A7203006998Bd0 ONLINE 0 0 0
c0t5000A72030069982d0 ONLINE 0 0 372

errors: No known data errors

This has been fixed in our tree.
Comments from engineer:
The core problem seems to have been caused by a mistaken assumption in my scan handling code where I assumed, since it was occurring in transaction group commit, that it couldn't be called in a multi-threaded fashion. This appears to have been in error. I'm not sure if multi-threaded behavior was only introduced by an upstream feature merge (multi-threaded spa sync), or if this issue has always been there.
commit 6a825cbea9b40efcea69c8166663135d29b1abf6
Author: Saso Kiselkov <>
Date: Mon Oct 8 16:14:19 2018 -0600

NEX-18589 checksum errors on SSD-based pool
Reviewed by: Roman Strashkin &lt;&gt;
Reviewed by: Sanjay Nadkarni &lt;&gt;

Also available in: Atom PDF