assertion failed: ht->ht_valid_cnt >= 0, file: ../../i86pc/vm/htable.c, line: 1204
A few months ago we started occasionally seeing this assertion firing during our nightly zfs-test runs (on debug build, obviously).
It happens about once every couple of weeks.
hati_page_unmap+0xe5(ffffff0090368b00, ffffff749d1e3b50, 13e)
hati_pageunload+0xb6(ffffff0090368b00, 0, 1)
pvn_vplist_dirty+0x29e(ffffff5bd7297500, 0, fffffffff78f7550, 110000, 0)
zfs_freesp+0xc5(ffffff6bb09ce228, 0, 0, 180, 1)
zfs_create+0x586(ffffff473e245300, ffffff339e4ebccf, ffffff00f60adbc0, 0, 180, ffffff00f60adc78)
fop_create+0xc7(ffffff473e245300, ffffff339e4ebccf, ffffff00f60adbc0, 0, 180, ffffff00f60adc78)
vn_createat+0x47a(8046f2a, 0, ffffff00f60adbc0, 0, 180, ffffff00f60adc78)
vn_openat+0x1be(8046f2a, 0, 2303, 380, ffffff00f60addf0, 0)
copen+0x220(ffd19553, 8046f2a, 2303, 380)
openat64+0x2a(ffd19553, 8046f2a, 302, 380)
open64+0x25(8046f2a, 302, 380)
In several of the panics, <htable addr>::whatthread lists the panic thread PLUS ONE OTHER, which is in proc_exit. The two threads do not belong to the same process, which threw me a bit for a while.
mdb: error reading thread name: no mapping for address
stack pointer for thread ffffff5bb15b8080 (ksh/1): ffffff00f751bcb0
ffffff00f751bd20 cpu_update_pct+0x44(ffffff00f66f7c40, 4bb1843e3365)
ffffff00f751be70 proc_exit+0xb6b(1, 0)
ffffff00f751be90 exit+0x15(1, 0)
Here's what I think is happening:
The thread doing zfs_trunc is unloading the memory mappings for the pages associated with the file (vnode) being truncated. The list of mappings on a page includes all mappings for that page, so this thread is accessing entries in other processes' HATs. (In this test there are many processes all using the same file). Using dtrace, I have proven this is indeed happening.
The exiting process destroys its HAT. There is no locking on that path. I assume the theory was that this is done by the last thread in the process and the HAT is process specific therefore it's okay to just destroy it.
However the zfs_trunc thread has a hold on an htable in the HAT of the exiting process (while it frees up a mapping therein). The proc_exit thread destroys its HAT and hence this htable, so when the zfs_trunc thread tries to release its hold on the htable it finds that the htable has been free'd... and the ASSERT fires.
It seems that this has been around, undiscovered, for a long time. However, with the "if t can happen it will happen" mindset, we should investigate ways to address it.
Updated by Joyce McIntosh about 1 year ago
From code observation, here is my theory on why the existing locking is not quite doing its job in this scenario:
During the zfs_trunc the associated pages' hments are being removed from the htable. A hold is placed on the htable while this is being done, so that it won't be freed. This hati_page_unload() is done within x86_hm_enter/exit.
During the proc_exit, the as_free() blocks in x86_hm_enter() during SEGOP_UNMAP - before the call to hat_free_end() to free the HAT and its htables. So far, so good.
However, the zfs_trunc thread (hati_page_unload) exits the lock before releasing its hold on the htable. In the window between exiting the lock and doing the htable_release() the proc_exit thread gets unblocked (by the unlock) and proceeds to call hat_free_end() which destroys the htable (regardless of holds on the htable). Thus when the zfs_trunc thread tries to do its htable_release() the htable has already been free'd and PANIC !!!
To validate this theory I opened up the timing window between the x86_hmexit() and htable_release(), using dtrace_chill(). With a little bit of a tweak to the test case, and the temporary addition of KMF_AUDIT flag on the htable cache, I am able to reproduce the panic and prove the theory.
hati_page_unmap+0xeb(ffffd000d9032a00, ffffd0576bff2800, 7f)
hati_pageunload+0xb6(ffffd000d9032a00, 0, 1)
pvn_vplist_dirty+0x29e(ffffd06a69470400, 0, fffffffff78f7690, 110000, 0)
zfs_freesp+0xc5(ffffd06bf3c8baa8, 0, 0, 180, 1)
zfs_create+0x586(ffffd0338bf7c900, ffffd0340e8d4ace, ffffd000faa50bc0, 0, 180, ffffd000faa50c78)
fop_create+0xc7(ffffd0338bf7c900, ffffd0340e8d4ace, ffffd000faa50bc0, 0, 180, ffffd000faa50c78)
vn_createat+0x47a(8046e0a, 0, ffffd000faa50bc0, 0, 180, ffffd000faa50c78)
vn_openat+0x1be(8046e0a, 0, 2303, 380, ffffd000faa50df0, 0)
copen+0x220(ffd19553, 8046e0a, 2303, 380)
openat64+0x2a(ffd19553, 8046e0a, 302, 380)
open64+0x25(8046e0a, 302, 380)
ffffd0576bff2800 is freed from htable_t:
ADDR BUFADDR TIMESTAMP THREAD
CACHE LASTLOG CONTENTS
ffffd03def4546e8 ffffd0576bff2800 98b15f2347fbe ffffd033e590a400
ffffd032e7e49008 ffffd0322812d900 ffffd032c6183ba8
Proper use of the htable reference count on the proc_exit path would resolve this.
Also looking into whether simply holding the lock across the htable_release() is feasible. The comment above the lock exit makes me question the viability of this approach:
/* drop the mapping list lock so that we might free the hment and htable. */
Updated by Joyce McIntosh about 1 year ago
Investigation found that the x86_hm_exit() must be done before htable_release to avoid a potential for deadlocking with HTABLE_ENTER.
Another alternative (low risk but somewhat bandaid) would be to use a hat_flag that hat_free_start() would block on, similar to the HAT_VICTIM/HAT_FREEING mechanism.
Updated by Joyce McIntosh 11 months ago
From further triage and testing, I believe that the timing window is in fact larger than this. The x86_hm_enter() will only block the process exit thread from destroying the htable IF the PTE/hment has not yet been deleted and the process exit thread finds it in the page table and enters x86_hm_enter() lock to remove it. IF hati_page_unmap() has already deleted the PTE/hment then the process exit thread will not find it and will thus not try to enter the x86_hm_enter() and may proceed with the HAT teardown.
Thus, a bandaid solution to synchronize these operations would need to be in force across the whole function (from before the PTE is removed), not only across the x86_hm_exit / htable_release.
Proper use of the htable reference count on the proc_exit path would be the better approach.