Project

General

Profile

Actions

Feature #13082

closed

pageout needs a deadman

Added by Joshua M. Clulow 11 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

There are some lingering issues in low memory situations that cause the system to deadlock while paging out to ZFS pools. These deadlocks usually present as pageout waiting for a ZIO to complete while the ZIO executor thread, or some other piece of the ZFS machinery, makes an allocation without KM_PUSHPAGE or from a thread that does not implicitly get to use the reserve page pool through NOMEMWAIT().

There are two other deadman mechanisms that did not appear to trip:

  • snooping -- I suspect that although we cannot see it, clock() is still running so this does not trip
  • spa_deadman (the ZFS I/O deadman) -- I suspect in this case we are stuck in parts of ZFS that prevent acquisition of the locks that vdev_deadman() needs to take to check for deadlocks

I think what we need is a specific deadman for the pageout() mechanism that will trip when we spend more than some number of seconds in a state like this:

fffffe000274bc20 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 zio_wait+0x6b
                 dmu_tx_count_write+0x1a9
                 dmu_tx_hold_write+0x57
                 zvol_strategy+0x211
                 bdev_strategy+0x64
                 spec_startio+0x8e
                 spec_pageio+0x3a
                 fop_pageio+0x5e
                 swap_putapage+0x1f4
                 swap_putpage+0x2d1
                 fop_putpage+0x56
                 pageout+0x276
                 thread_start+0xb

(We are waiting, here, on a ZIO that will never complete:)

fffffe000245dc20 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 page_create_throttle+0x168
                 page_create_va+0x598
                 segkmem_page_create+0x97
                 segkmem_xalloc+0x13f
                 segkmem_alloc_vn+0x3b
                 segkmem_alloc+0x17
                 vmem_xalloc+0x629
                 vmem_alloc+0x190
                 kmem_slab_create+0x7c
                 kmem_slab_alloc+0x10b
                 kmem_cache_alloc+0x15b
                 zfs_btree_add_idx+0x1b2
                 range_tree_add_impl+0x74f
                 range_tree_add+0x13
                 metaslab_free_concrete+0xe4
                 metaslab_free_impl+0xd1
                 metaslab_free_dva+0x77
                 metaslab_free+0x1ae
                 zio_dva_free+0x25
                 zio_execute+0x7c
                 zio_wait+0x40
                 zio_free+0xdc
                 dsl_free+0x13
                 dsl_dataset_block_kill+0x27e
                 dbuf_write_done+0x126
                 arc_write_done+0x99
                 zio_done+0x402
                 zio_execute+0x7c
                 taskq_thread+0x2cd
                 thread_start+0xb

While we might not actually have enough memory in reserve to even begin dumping, it would be nice to at least try, and failing that, to at least allow either an automatic reboot and return to service or allow the curious engineer to wind up at a KMDB prompt.

Actions #1

Updated by Joshua M. Clulow 11 months ago

  • Assignee set to Joshua M. Clulow
Actions #2

Updated by Joshua M. Clulow 11 months ago

I have a relatively simple prototype for this that works well enough that I have been able to get a few memory deadlock dumps from AWS:

https://github.com/jclulow/illumos-gate/commit/2ecfa1fe238d8e4fdab04ab0998164dbc5583fb8

One of the dumps was indeed quite low on memory:

> freemem/D
freemem:
freemem:        0
Actions #3

Updated by Electric Monk 11 months ago

  • Gerrit CR set to 883
Actions #4

Updated by Joshua M. Clulow 11 months ago

Testing Notes

I have a 1GB OpenIndiana VM. Things work about as expected unless there is severe memory pressure from, say, a user program. Under protracted pressure, we hit a deadlock within a minute or so of pageout activity:

NMI received: entering kmdb

kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi
[0]> pushes/D
pushes:
pushes:         3

[0]> ::stacks -c pageout | ::findstack -v
stack pointer for thread fffffe0002b3dc20 (pageout/1): fffffe0002b3c820
[ fffffe0002b3c820 _resume_from_idle+0x12b() ]
  fffffe0002b3c850 swtch+0x133()
  fffffe0002b3c890 cv_wait+0x68(fffffffffbc415fc, fffffffffbc415f0)
  fffffe0002b3c8f0 page_create_wait+0x113(20, 43)
  fffffe0002b3c9d0 page_create_va+0x569(fffffffffbd087c0, fffffe0202063000, 20000, 53, fffffe0002b3c9e0, fffffe0202063000)
  fffffe0002b3ca60 segkmem_page_create+0x97(fffffe0202063000, 20000, 4, fffffffffbd087c0)
  fffffe0002b3cb00 segkmem_xalloc+0x13f(fffffe01e4c1f000, 0, 20000, 4, 0, fffffffffb8a7ed0, fffffffffbd087c0)
  fffffe0002b3cb70 segkmem_alloc_vn+0x3b(fffffe01e4c1f000, 20000, 4, fffffffffbd087c0)
  fffffe0002b3cba0 segkmem_alloc+0x17(fffffe01e4c1f000, 20000, 4)
  fffffe0002b3ccb0 vmem_xalloc+0x629(fffffe01e4c20000, 20000, 1000, 0, 0, 0, 0, 4)
  fffffe0002b3cd20 vmem_alloc+0x190(fffffe01e4c20000, 20000, 4)
  fffffe0002b3cdb0 kmem_slab_create+0x7c(fffffe01e5fc3008, 4)
  fffffe0002b3ce10 kmem_slab_alloc+0x10b(fffffe01e5fc3008, 4)
  fffffe0002b3ce70 kmem_cache_alloc+0x15b(fffffe01e5fc3008, 4)
  fffffe0002b3cea0 zfs`zio_buf_alloc+0x59(20000)
  fffffe0002b3cee0 zfs`abd_borrow_buf+0x54(fffffe01ecbf0bc0, 20000)
  fffffe0002b3cf50 zfs`vdev_disk_io_start+0x21d(fffffe02006bc958)
  fffffe0002b3cfa0 zfs`zio_vdev_io_start+0x89(fffffe02006bc958)
  fffffe0002b3cfd0 zfs`zio_execute+0x7c(fffffe02006bc958)
  fffffe0002b3d000 zfs`zio_nowait+0x2b(fffffe02006bc958)
  fffffe0002b3d090 zfs`vdev_mirror_io_start+0xba(fffffe0200740228)
  fffffe0002b3d0e0 zfs`zio_vdev_io_start+0x1a2(fffffe0200740228)
  fffffe0002b3d110 zfs`zio_execute+0x7c(fffffe0200740228)
  fffffe0002b3d140 zfs`zio_nowait+0x2b(fffffe0200740228)
  fffffe0002b3d260 zfs`arc_read+0x849(fffffe01e7f352e8, fffffe01e5f7e000, fffffe0002b3d348, 0, 0, 2, ffffffff00000180, fffffe0002b3d294, fffffe01f4059418)
  fffffe0002b3d2b0 zfs`dbuf_issue_final_prefetch+0x72(fffffe01f4059410, fffffe0002b3d348)
  fffffe0002b3d430 zfs`dbuf_prefetch+0x47f(fffffe01ecfdcab0, 1, ff, 2, 20)
  fffffe0002b3d4b0 zfs`dmu_zfetch+0x248(fffffe01ecfdcd28, 3bbf7, e, 0)
  fffffe0002b3d560 zfs`dmu_buf_hold_array_by_dnode+0x317(fffffe01ecfdcab0, 3bbf7000, e000, 0, fffffffff7a69558, fffffe0002b3d62c, fffffe0002b3d630, fffffe0200000000)
  fffffe0002b3d600 zfs`dmu_buf_hold_array+0x92(fffffe01e8f12d00, 1, 3bbf7000, e000, 0, fffffffff7a69558, fffffe0002b3d62c, fffffe0002b3d630)
  fffffe0002b3d690 zfs`dmu_write+0x98(fffffe01e8f12d00, 1, 3bbf7000, e000, fffffe021563f000, fffffe01e8b5d280)
  fffffe0002b3d720 zfs`zvol_strategy+0x31b(fffffe01f9cef600)
  fffffe0002b3d750 bdev_strategy+0x64(fffffe01f9cef600)
  fffffe0002b3d7b0 specfs`spec_startio+0x8e(fffffe01eb360380, fffffe0000305918, 3bbf7000, e000, 8500)
  fffffe0002b3d810 specfs`spec_pageio+0x3a(fffffe01eb360380, fffffe0000305918, 3bbf7000, e000, 8500, fffffe01e5c84db0, 0)
  fffffe0002b3d8a0 fop_pageio+0x5e(fffffe01eb360380, fffffe0000305918, 3bbf7000, e000, 8500, fffffe01e5c84db0, 0)
  fffffe0002b3d980 swap_putapage+0x1f4(fffffe01ea628980, fffffe0000305918, fffffe0002b3d9b8, fffffe0002b3d9c0, 8400, fffffe01e5c84db0)
  fffffe0002b3da30 swap_putpage+0x2d1(fffffe01ea628980, 1fffffc03ed86000, 1000, 8400, fffffe01e5c84db0, 0)
  fffffe0002b3dab0 fop_putpage+0x56(fffffe01ea628980, 1fffffc03ed86000, 1000, 8400, fffffe01e5c84db0, 0)
  fffffe0002b3db10 pageout+0x271()
  fffffe0002b3db20 thread_start+0xb()

[0]> freemem/D
freemem:
freemem:        8

To confirm no forward progress, I resumed (:c) and NMIed again a few times and checked that there was no more movement. The machine still responds to ICMP pings, and will still echo new lines (but no login prompt) on the console while in this state.

Rebooting into the bits built for this change and running the memory load generator program again, we see:

panic[cpu0]/thread=fffffe00020cbc20: pageout_deadman: stuck pushing the same page for 90 seconds (freemem is 236)

fffffe00020cba60 fffffffffbaf2767 ()
fffffe00020cbad0 genunix:clock+743 ()
fffffe00020cbb60 genunix:cyclic_softint+e2 ()
fffffe00020cbb80 unix:cbe_softclock+19 ()
fffffe00020cbbd0 unix:av_dispatch_softvect+72 ()
fffffe00020cbc00 unix:dispatch_softint+39 ()
fffffe0002005a20 unix:switch_sp_and_call+15 ()
fffffe0002005a60 unix:dosoftint+40 ()
fffffe0002005ab0 unix:do_interrupt+bc ()
fffffe0002005ac0 unix:cmnint+c3 ()
fffffe0002005bb0 unix:mach_cpu_idle+b ()
fffffe0002005be0 unix:cpu_idle+10f ()
fffffe0002005c00 unix:idle+ae ()
fffffe0002005c10 unix:thread_start+b ()

panic: entering debugger (continue to save dump)
Loaded modules: [ random ]
kmdb: target stopped at:
kmdb_enter+0xb: movq   %rax,%rdi

[0]> pageout_pushcount/D
pageout_pushcount:
pageout_pushcount:              9183

[0]> pageout_pushing/D
pageout_pushing:
pageout_pushing:1

[0]> ::stacks -c pageout
THREAD           STATE    SOBJ                COUNT
fffffe0002b3dc20 SLEEP    MUTEX                   1
                 swtch+0x133
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 zfs`vdev_queue_io+0x67
                 zfs`zio_vdev_io_start+0x1ea
                 zfs`zio_execute+0x7c
                 zfs`zio_nowait+0x2b
                 zfs`vdev_mirror_io_start+0xba
                 zfs`zio_vdev_io_start+0x1a2
                 zfs`zio_execute+0x7c
                 zfs`zio_nowait+0x2b
                 zfs`arc_read+0x849
                 zfs`dbuf_read_impl+0x3e6
                 zfs`dbuf_read+0xc9
                 zfs`dbuf_findbp+0x1a3
                 zfs`dbuf_hold_impl+0x1ab
                 zfs`dbuf_hold_level+0x32
                 zfs`dbuf_hold+0x18
                 zfs`dmu_buf_hold_array_by_dnode+0xe6
                 zfs`dmu_buf_hold_array+0x92
                 zfs`dmu_write+0x98
                 zfs`zvol_strategy+0x31b
                 bdev_strategy+0x64
                 specfs`spec_startio+0x8e
                 specfs`spec_pageio+0x3a
                 fop_pageio+0x5e
                 swap_putapage+0x1f4
                 swap_putpage+0x2d1
                 fop_putpage+0x56
                 pageout+0x278
                 thread_start+0xb

[0]> ::stacks -c zio_execute
THREAD           STATE    SOBJ                COUNT
fffffe0002a6ac20 SLEEP    CV                      2
                 swtch+0x133
                 cv_wait+0x68
                 page_create_throttle+0x160
                 page_create_va+0x598
                 segkmem_page_create+0x97
                 segkmem_xalloc+0x13f
                 segkmem_alloc_vn+0x3b
                 segkmem_alloc+0x17
                 vmem_xalloc+0x629
                 vmem_alloc+0x190
                 kmem_slab_create+0x7c
                 kmem_slab_alloc+0x10b
                 kmem_cache_alloc+0x15b
                 zfs`zio_buf_alloc+0x59
                 zfs`zio_write_compress+0xca
                 zfs`zio_execute+0x7c
                 taskq_thread+0x2cd
                 thread_start+0xb

fffffe000242ac20 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 page_create_throttle+0x160
                 page_create_va+0x598
                 segkmem_page_create+0x97
                 segkmem_xalloc+0x13f
                 segkmem_alloc_vn+0x3b
                 segkmem_alloc+0x17
                 vmem_xalloc+0x629
                 vmem_alloc+0x190
                 kmem_slab_create+0x7c
                 kmem_slab_alloc+0x10b
                 kmem_cache_alloc+0x15b
                 zfs`zio_buf_alloc+0x59
                 zfs`abd_alloc_linear+0x77
                 zfs`abd_alloc_for_io+0xb
                 zfs`vdev_queue_aggregate+0x26d
                 zfs`vdev_queue_io_to_issue+0x3b
                 zfs`vdev_queue_io_done+0x7d
                 zfs`zio_vdev_io_done+0xc0
                 zfs`zio_execute+0x7c
                 taskq_thread+0x2cd
                 thread_start+0xb

fffffe0002a64c20 SLEEP    CV                      1
                 swtch+0x133
                 cv_wait+0x68
                 page_create_throttle+0x160
                 page_create_va+0x598
                 segkmem_page_create+0x97
                 segkmem_xalloc+0x13f
                 segkmem_alloc_vn+0x3b
                 segkmem_alloc+0x17
                 vmem_xalloc+0x629
                 vmem_alloc+0x190
                 kmem_slab_create+0x7c
                 kmem_slab_alloc+0x10b
                 kmem_cache_alloc+0x15b
                 zfs`zio_create+0x45
                 zfs`zio_vdev_child_io+0xda
                 zfs`vdev_mirror_io_start+0xb2
                 zfs`zio_vdev_io_start+0x1a2
                 zfs`zio_execute+0x7c
                 taskq_thread+0x2cd
                 thread_start+0xb

fffffe00023eec20 SLEEP    MUTEX                   1
                 swtch+0x133
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 zfs`vdev_queue_io+0x67
                 zfs`zio_vdev_io_start+0x1ea
                 zfs`zio_execute+0x7c
                 zfs`zio_nowait+0x2b
                 zfs`vdev_mirror_io_start+0xba
                 zfs`zio_vdev_io_start+0x1a2
                 zfs`zio_execute+0x7c
                 taskq_thread+0x2cd
                 thread_start+0xb

fffffe0002b3dc20 SLEEP    MUTEX                   1
                 swtch+0x133
                 turnstile_block+0x25b
                 mutex_vector_enter+0x358
                 zfs`vdev_queue_io+0x67
                 zfs`zio_vdev_io_start+0x1ea
                 zfs`zio_execute+0x7c
                 zfs`zio_nowait+0x2b
                 zfs`vdev_mirror_io_start+0xba
                 zfs`zio_vdev_io_start+0x1a2
                 zfs`zio_execute+0x7c
                 zfs`zio_nowait+0x2b
                 zfs`arc_read+0x849
                 zfs`dbuf_read_impl+0x3e6
                 zfs`dbuf_read+0xc9
                 zfs`dbuf_findbp+0x1a3
                 zfs`dbuf_hold_impl+0x1ab
                 zfs`dbuf_hold_level+0x32
                 zfs`dbuf_hold+0x18
                 zfs`dmu_buf_hold_array_by_dnode+0xe6
                 zfs`dmu_buf_hold_array+0x92
                 zfs`dmu_write+0x98
                 zfs`zvol_strategy+0x31b
                 bdev_strategy+0x64
                 specfs`spec_startio+0x8e
                 specfs`spec_pageio+0x3a
                 fop_pageio+0x5e
                 swap_putapage+0x1f4
                 swap_putpage+0x2d1
                 fop_putpage+0x56
                 pageout+0x278
                 thread_start+0xb

A tale from of woe from which there is generally no return. At least, now, we will panic and generally get a dump to diagnose!

Actions #5

Updated by Joshua M. Clulow 11 months ago

Testing Notes (Supplemental)

Demonstrating that the disabled value has an effect. First, with the new defaults we see that the pageout deadman is running and observing new pushcount values once a second:

root@oi0:~# dtrace -qn 'BEGIN { s = walltimestamp; } tick-1s { printf("%6d deadman_seconds %d pushcount %d pushcount_seen %d\n", (walltimestamp - s) /1000000, `pageout_deadman_seconds, `pageout_pushcount, `pageout_pushcount_seen); }'
   985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
  1985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
  2985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
  3985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
  4985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
  5985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
...
 38988 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
 39985 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
 40991 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
 41990 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
 42982 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
 43981 deadman_seconds 90 pushcount 12431 pushcount_seen 12431
 44981 deadman_seconds 90 pushcount 14812 pushcount_seen 12992
 45981 deadman_seconds 90 pushcount 16071 pushcount_seen 16071
 46981 deadman_seconds 90 pushcount 16225 pushcount_seen 16141
Timeout, server 10.1.1.57 not responding.

At this point the system hung for 90 seconds and then panicked. Note that we are not tracing the deadman check itself (running from clock()) but rather observing its effect on the tracking variables.

Now, let's try again with the deadman disabled:

root@oi0:~# mdb -kwe pageout_deadman_seconds/W0
pageout_deadman_seconds:        0x5a            =       0x0

root@oi0:~# dtrace -qn 'BEGIN { s = walltimestamp; } tick-1s { printf("%6d deadman_seconds %d pushcount %d pushcount_seen %d\n", (walltimestamp - s) /1000000, `pageout_deadman_seconds, `pageout_pushcount, `pageout_pushcount_seen); }'
   936 deadman_seconds 0 pushcount 460 pushcount_seen 460
...
 31918 deadman_seconds 0 pushcount 460 pushcount_seen 460
 32918 deadman_seconds 0 pushcount 460 pushcount_seen 460
 33917 deadman_seconds 0 pushcount 460 pushcount_seen 460
 34917 deadman_seconds 0 pushcount 528 pushcount_seen 460
 35916 deadman_seconds 0 pushcount 569 pushcount_seen 460
 36916 deadman_seconds 0 pushcount 570 pushcount_seen 460
 37915 deadman_seconds 0 pushcount 570 pushcount_seen 460
 38914 deadman_seconds 0 pushcount 1960 pushcount_seen 460
 39914 deadman_seconds 0 pushcount 3779 pushcount_seen 460
 40913 deadman_seconds 0 pushcount 3779 pushcount_seen 460
 41913 deadman_seconds 0 pushcount 3779 pushcount_seen 460
 42912 deadman_seconds 0 pushcount 4390 pushcount_seen 460
 43912 deadman_seconds 0 pushcount 5302 pushcount_seen 460
 44911 deadman_seconds 0 pushcount 7207 pushcount_seen 460
 45910 deadman_seconds 0 pushcount 8592 pushcount_seen 460
 46910 deadman_seconds 0 pushcount 8592 pushcount_seen 460
 47909 deadman_seconds 0 pushcount 8592 pushcount_seen 460
 48909 deadman_seconds 0 pushcount 8592 pushcount_seen 460
 49908 deadman_seconds 0 pushcount 8592 pushcount_seen 460
 50908 deadman_seconds 0 pushcount 8848 pushcount_seen 460
 51907 deadman_seconds 0 pushcount 8848 pushcount_seen 460
 52907 deadman_seconds 0 pushcount 8848 pushcount_seen 460

In this case, the seen count is not updated which means the early return in pageout_deadman() is working and it is having no effect. Checking that it can be disabled in a persistent fashion:

root@oi0:~# echo 'set pageout_deadman_seconds = 0' > /etc/system.d/no_pageout_deadman
root@oi0:~# reboot
updating /platform/i86pc/amd64/boot_archive (CPIO)
...
$ ssh oi0
$ pfexec mdb -ke 'pageout_deadman_seconds/D'
pageout_deadman_seconds:
pageout_deadman_seconds:        0
Actions #6

Updated by Electric Monk 10 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 727737b40e05e03cb1b298a96f67258b116ba990

commit  727737b40e05e03cb1b298a96f67258b116ba990
Author: Joshua M. Clulow <josh@sysmgr.org>
Date:   2020-09-23T22:22:51.000Z

    13082 pageout needs a deadman
    Reviewed by: Andy Fiddaman <andy@omniosce.org>
    Reviewed by: Jason King <jason.brian.king@gmail.com>
    Reviewed by: Peter Tribble <peter.tribble@gmail.com>
    Approved by: Robert Mustacchi <rm@fingolfin.org>

Actions

Also available in: Atom PDF