Project

General

Profile

Bug #10844

Serialize ZTHR operations to eliminate races

Added by Jerry Jelinek 5 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2019-04-22
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

ZoL 61c3391acc9 Serialize ZTHR operations to eliminate races

History

#1

Updated by Jerry Jelinek 5 months ago

From the ZoL commit msg:
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

#2

Updated by Dan McDonald 5 months ago

From ZoL commiter Serapheim Dimitropoulos:

This issue was filed in ZoL (https://github.com/zfsonlinux/zfs/issues/7744) but we've also hit it
internally within DelphixOS (illumos-variant) and thus affects illumos too.

The bug was discovered by an assertion failure during a ztest run. The specific assertion is the
`ASSERT;` statement located in `zthr_resume()` of zthr.c and can fail when
multiple threads are trying to destroy/export a pool and at least one of them encounters an
error.

Timeline includes (4 threads: 3 [thread A, B, C] trying to destroy the pool and 1 [thread D] the actual zthr):
1] Thread A starts and sends a cancel request to Thread D waiting for it to be cancelled
2] Thread B starts and sends a cancel request to Thread D waiting for it to be cancelled
3] Thread D gets cancelled and both thread A and B wake up.
4] Thread A fails its destroy because Thread B has a reference to the pool, so it decided to restart thread D as part of its cleanup.
5] Thread C starts and sends a cancel request to Thread D waiting for it to be cancelled
6] Thread B fails its destroy because Thread C has a reference to the pool, so it decided to restart thread D which is being cancelled and thus hits the assertion.

After encountering this bug and spending close attention on how that piece of infrastructure
and its locking was organized together with the wider pool locking we found a few more
variants of the above race. The ZoL fix should get rid of all of the races of this class.

As for the impact of the actual bug; it is minimal. The case has only been reproduced within
ztest which actually avoids a lot of locks on its path (compared to issuing commands from
the userland utilities). In addition, if this bug were to come up in a non-debug build (no ASSERTS)
then the result would be one leaked thread that is sleeping. If you want to test whether your
system has been affected, do the math and ensure that you don't have more than the expected
threads.

Hypothetical example:
If your version of ZFS supports condensing after removal and the pool checkpoint, and the split
in the arc reclaim jobs you should have 4 zthr threads per pool. If you see on the stack of mdb/kmdb
that there exists 5 zthrs then you must have hit this bug.

#3

Updated by Electric Monk 5 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 6a316e1f6d32750bb8fcf2558dcb17b90ca580fd

commit  6a316e1f6d32750bb8fcf2558dcb17b90ca580fd
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Date:   2019-04-29T18:49:59.000Z

    10844 Serialize ZTHR operations to eliminate races
    Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Matt Ahrens <mahrens@delphix.com>
    Reviewed by: Tom Caputi <tcaputi@datto.com>
    Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
    Reviewed by: Kody Kantor <kody.kantor@joyent.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF