panic while reaping htable_cache
The following panic was observed (quite rare, under heavy load, ):
message mutex_enter: bad mutex, lp=ffffff2328dd64c0 owner=deadbeefdeadbee8 thread=ffffff00f4017c60
This is with kmem_flags=0xf, so it appears that this is "use after free" situation, and we are able to see where the buffer is freed from with ::bufctl -v:
This appears to be a process address space teardown code path. After reviewing the code, the apparent design intent is as follows: in order to prevent the race between the teardown and reap, the hat_t's are atomically checked for HAT_VICTIM and marked HAT_FREEING in hat_free_start(). Then the reap code skips the HAT_FREEING hat_t's when victimizing.
Yet it also appears that we still see the race. Looking at the htable_steal(), we find that the inner for(;;) loop seems to be re-setting the HAT_VICTIM flag after a collection of htable_t's was taken off the hat_t list but before they are actually freed.
This seems to allow for a situation where the teardown thread waiting for the hat_t's HAT_VICTIM flag to be cleared in hat_free_start(), will come out and set the HAT_FREEING flag at some later point in time, and the hat_free_end() will teardown/free that hat_t. Yet the reap code path will not check for the HAT_FREEING (or will fail to notice the flag setting - depending on the scheduling - the checks are not done under the lock), so the htable_t's taken off the (freed) hat_t list in htable_steal() will still be reaped.
Specifically, in our case, if the teardown is completed first (the buffer is freed and reused), and then the htable_free()s are called in the reap code path, the htable_free()s will be looking at garbage when checking for the HAT_FREEING flag in the first if(...!(hat->hat_flag & HAT_FREEING)...) statement of the HAT_FREEING. As a consequence, hat_enter() will be called, and it will cause the above panic if kmem_flags=0xf are set or the mutex is trashed in some other way.
To summarize, the cause of the panic appears to be stale ht_hat pointer in the htable_t at the time of calling htable_free(). The ht_hat points to freed space (arbitrary garbage). The reason for this is a race condition between teardown and reap control flows, which is not being properly addressed by the combination of HAT_VICTIM and HAT_FREEING flags.
The proposed solution is to disassociate htable_t's to be reaped from the hat_t object at the time the former a gathered for reaping. This happens in htable_steal() function. The latter is also called from htable_alloc(), and in this case, we do not reap/disassociate from the hat_t. Therefore, an additional parameter that conveys the context is needed for htable_steal() to do the right thing in both cases: in case of reaping context, we zero out hat_t pointer, and we perform platform specific ptable_free() work (as this was previously done in the original code).
Later, in htable_free() we check for ht_hat pointer in htable and perfrom the needed actions, while guaranteeing race-free execution in reap context - there is no dependency on ht_hat, so we do not need to worry about it going away.
Updated by Boris Protopopov over 7 years ago
One other detail that seemed amiss in the code was that it was not obvious whether the htable_dont_cache flag was set when htable_free() was called from htable_reap(). However, it would seem that if the flag was not set, then htable_free() could put the htable back on the hat's cached_list. This would have defeated the purpose of reaping - the htable would not be freed but would still be cached.
The proposed fix is to set/increment the dont_cache flag in htable_reap() before calling htable_steal() and to decrement the flag only after htable_free() loop is done, in order to assure proper freeing.
Updated by Electric Monk over 6 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
commit b59c4a48daf5a1863ecac763711b497b2f8321e4 Author: Boris Protopopov <firstname.lastname@example.org> Date: 2014-11-05T22:53:03.000Z 4110 panic while reaping htable_cache Reviewed by: Gordon Ross <Gordon.Ross@nexenta.com> Reviewed by: Ilya Usvyatsky <Ilya.Usvyatsky@nexenta.com> Reviewed by: Thomas Keiser <email@example.com> Reviewed by: Josef 'Jeff' Sipek <firstname.lastname@example.org> Approved by: Garrett D'Amore <email@example.com>