Project

General

Profile

Actions

Bug #13092

closed

ZFS I/O pipeline should use the pageout_reserve pool

Added by Joshua M. Clulow almost 3 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Category:
zfs - Zettabyte File System
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:
External Bug:

Description

When the system is under memory pressure (freemem is below lotsfree), the pageout_scanner() begins scanning memory to look for pages we can push out to disk-based swap. The I/O requests to send dirty pages to disk so that they may then be freed is performed by pageout(). As the system is often nearly out of memory, it is critical that this mechanism be able to perform I/O without blocking to allocate, and the mechanism we have to try and ensure that this is possible is the pageout reserve pool (pageout_reserve).

Any routine that pageout() calls is able to access pages in the reserve pool, even when other threads would be throttled (see NOMEMWAIT(), KM_PUSHPAGE, page_create_throttle(), etc). When swapping to ZFS, this includes any memory (e.g., ARC buffers, range tree additions) allocated in the eventual zvol_strategy() call. Unfortunately it does not extend to the rest of the ZFS I/O pipeline (see zio_execute()) which will still be blocked when it inevitably requires additional pages that are not available. If the pipeline becomes blocked, pool I/O will generally stall; if this stall does not resolve due to I/O already in flight, the system will enter the low memory deadlock condition from which there is no return.

This condition appears to have been acknowledged, and piecemeal fixes have been made through ZFS; one need only look for use of the KM_PUSHPAGE flag. As we increasingly take patches from other systems, where swap to ZFS is at least less popular, if not simply unavailable, it would seem that we need to increase structurally the likelihood that ZFS allocations will dip into the reserve.

I would like to add a new thread flag, T_PUSHPAGE, which would grant the same status as being part proc_pageout process today. zio_execute() would set this flag before entering the pipeline, and remove it once complete. If we discover more asynchronous contexts in ZFS that can block pageout() in future, we can ensure those threads get the treatment as well. In order to reduce risk, I am not currently proposing that we use this thread in place of the proc_pageout (etc) checks in NOMEMWAIT(), nor am I proposing unwinding the explicit use of KM_PUSHPAGE in ZFS today.


Files

out.svg (64.8 KB) out.svg flamegraph of T_PUSHPAGE allocation Joshua M. Clulow, 2021-02-02 08:27 AM
out.stacks.txt (21.8 KB) out.stacks.txt stacks with T_PUSHPAGE allocation Joshua M. Clulow, 2021-02-02 08:27 AM
ss.png (121 KB) ss.png Joshua M. Clulow, 2021-02-02 08:30 AM

Related issues

Related to illumos gate - Bug #13093: ZFS should be more careful in low memory situationsNewJoshua M. Clulow

Actions
Actions #1

Updated by Joshua M. Clulow almost 3 years ago

  • Category changed from kernel to zfs - Zettabyte File System
Actions #2

Updated by Joshua M. Clulow almost 3 years ago

  • Related to Bug #13093: ZFS should be more careful in low memory situations added
Actions #3

Updated by Electric Monk almost 3 years ago

  • Gerrit CR set to 891
Actions #4

Updated by Joshua M. Clulow over 2 years ago

Testing Notes

I have been trying to run the ZFS test suite, but it is frankly not in good shape. There are a few errors that seem legitimate, and a raft of errors that don't seem related; e.g., UFS having issues writing its log, incorrect setup/teardown of dump devices, etc. I haven't seen anything that suggests to me that I've made it worse with this change.

I explicitly checked for use of the T_PUSHPAGE flag at the point of allocation of pages with this D script, while exhausting memory on the system to induce pageout() push activity:

#!/usr/sbin/dtrace -qCs

#pragma D option stackframes=100

#define T_PUSHPAGE 0x20000

fbt::page_create_throttle:entry
/(curthread->t_flag & T_PUSHPAGE) != 0/
{
        @[stack()] = count();
}

The collected stacks and a flamegraph are attached. As expected, this is all within the ZIO pipeline, except for a few calls to pageout() which end up traversing through zio_execute().

Actions #5

Updated by Joshua M. Clulow over 2 years ago

Actions #6

Updated by Electric Monk over 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit f6ef42236c2f60c5f16c454c9b574a4dc35e1cab

commit  f6ef42236c2f60c5f16c454c9b574a4dc35e1cab
Author: Joshua M. Clulow <josh@sysmgr.org>
Date:   2021-02-03T00:42:59.000Z

    13092 ZFS I/O pipeline should use the pageout_reserve pool
    Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Reviewed by: Jason King <jason.brian.king@gmail.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF