Bug #1667
closedpkcs11 may deadlock when multi-threaded consumers fork
100%
Description
Several times, the Jenkins instance I used to provide putback-triggered
illumos builds and packages has inexplicably hung while polling for new SCM
changes. A message is produced saying that no polling activity is occuring,
and builds cease.
I've, twice, investigated why and found the following things to be true.
- Jenkins has forked and has a child process
- This child has not called exec (though presumably wishes it had)
- This child has existed since approximately the last SCM poll + one interval
The parent process is still, unfortunately, live, so its state as compared to
the child is represented only by the child, to the extent the child has not
modified it post-fork.
The state of the child looks like this:
b38e8008 libc_hwcap1.so.1`__lwp_park+0x19(b5154a40, 0, b5eca148, 0) b38e8048 libc_hwcap1.so.1`mutex_lock_impl+0x163(b5eca148, 0, b5e08143, b5ea0018, b5eb0e30, 0) b38e8068 libc_hwcap1.so.1`mutex_lock+0x10(b5eca148) b38e8088 libcryptoutil.so.1`pkcs11_close_common+0x1c(b5eca110, b5eca148, 1, b5eb677c) b38e80a8 libcryptoutil.so.1`pkcs11_close_urandom+0x29(0, b5e74000, b38e80d8, b5e7440c, b5e74420, 0) b38e80d8 pkcs11_softtoken.so.1`finalize_common+0xaa(1) b38e80f8 pkcs11_softtoken.so.1`softtoken_fini+0x3b(feffb8f4, 0, b5ea0718, b38e814c, fefcdbdf, 0) b38e810c pkcs11_softtoken.so.1`_fini+0x25(0, 0, feffb388, feffb2e4, feffb168, b5f10810) b38e814c ld.so.1`call_fini+0xff(feffb168, b4e60768, b5f10810, 0) b38e81bc ld.so.1`remove_hdl+0x78c(b5ea0bc8, b5f10810, 0, fefd265a) b38e81dc ld.so.1`dlclose_core+0xbf(b5ea0bc8, b5f10810, feffb168, fefd2756) b38e821c ld.so.1`dlclose_intn+0x2a(b5ea0bc8, b5f10810, 0, fefd27c8) b38e825c ld.so.1`dlclose_check+0x97(b5ea0bc8, b5f10810, 0, fefd288d) b38e829c ld.so.1`dlclose+0x46(b5ea0bc8, b5efa000, b38e82cc, b5ee5bc2) b38e82dc libpkcs11.so.1`pkcs11_slottable_delete+0xbf(fedc07c0, b5efa000, b38e830c, b5ee182e, 0, b5e74000) b38e82ec libpkcs11.so.1`finalize_common+0xe2(0) b38e830c libpkcs11.so.1`pkcs11_fini+0x46(fedc0780, fedc0080, b38e832c, fee2320d, 0, fef20000) b38e831c libpkcs11.so.1`pkcs11_fork_child_fini+0xb(0, fef20000, b38e835c, fee8c791, fef20000, fef24580) b38e832c libc_hwcap1.so.1`_postfork_child_handler+0x2f(fef20000, fef24580, fef24540, fef244c0, 0, 93fcd18) b38e835c libc_hwcap1.so.1`forkx+0x199(0, 0, fee3a642, fee8c844, fe262000) b38e837c libc_hwcap1.so.1`fork+0x1d(93fcc00, b87a61a8, b87a61b0, e6a0ce24, 15, e6a0cd8c) b38e83e8 libjava.so`Java_java_lang_UNIXProcess_forkAndExec+0x395(93fcd18, b38e8480, b38e847c, b38e8478, 7, b38e8470) b38e8448 0xfb00a152(0, fb0080f9, e6a0ce70, e6a0ce60, e6a0ce50, 1) b38e84a8 0xfb002f69(1, e6a0ce18, f, e6a0ca48, 7, e6a0c978) e6a0ccd0 0xfb06b7f8(0, 0, 0, e6a0ce50, e6a0ce60, e6a0ce70) 00000105 0xb87a6f30()
We just forked, and we're running the atfork handles for libpkcs11, one of
which deletes the slottable, which dlclose's the subordinate modules,
including pkcs11_softtoken, that closes the urandom device via
pkcs11_close_urandom, and tries to take the libcryptoutil &urandom_mutex,
which is already held.
It is already held because, at the moment the parent (main jenkins) process
forked it held &urandom_mutex, and thus it is also held in the child process.
However, the child does not contain the thread which took the mutex, so it
will never be released, and the child is deadlocked under the (child's side of)
fork(2).
Files
Related issues
Updated by Rich Lowe over 11 years ago
The simple case of this can be fixed with the use of pthread_atfork(), in precisely the manner described in its manual page. However this case cannot be. This is because in this case we are actually within an atfork child handler (that of libpkcs11) that will run before the obvious one in libcryptoutil, so the normal method of fixing this will, instead, guarantee that the bug is hit.
A distasteful option would be to add a routine to libcryptoutil to drop all of that libraries locks, call it from the libpkcs11 fork handler (as well as libcryptoutil's), which should solve the problem, but is obviously nasty.
A better solution would be to rearchitect the way libpkcs11 tears itself down to prevent this whole class of bug (which, looking through the history, has been a constant annoyance).
I'm still undecided whether to apply the tasty solution as a bandaid, and file a bug for the real fix.
Updated by Dan McDonald almost 9 years ago
NOTE: Make sure you see #4909 for a DTrace script to detect these.
The man page refers to the PKCS#11 v2.20 spec (wrong section as well). This version of the spec is unspecific about multithreaded fork behavior.
Updated by Dan McDonald almost 9 years ago
The current method of libcryptoutil's pkcs11_*random_*() interfaces arrived with this commit:
commit 7b79d84636ec82b45f00c982cf6810db81852d17
Author: Dina K Nimeh <Dina.Nimeh@Sun.COM>
Date: Fri Mar 20 11:56:57 2009 -0700
6666204 meta slot opens and closes /dev/urandom needlessly for every read
6722460 finish moving /dev/random and /dev/urandom seeding and usage to libc
It introduced public interfaces to libcryptoutil for: pkcs11_close_urandom(), pkcs11_close_random(), and pkcs11_close_urandom_seed().
These are present so pkcs11_softtoken()'s finalize_common() (all three) and meta_Finalize() (both urandom ones) can close the file descriptors as part of a PKCS#11 C_Finalize() operation, or as part of a pthread_atfork() handler in the case of pkcs11_softtoken().
The BETTER approach is to remove these calls from outside libcryptoutil, and add pthread_atfork() handlers for libcryptoutil. I believe (and testing will bear this out) that the way libcryptoutil handles itself, the open file descriptors for /dev/{,u}random will still be there, for use from other, non-softtoken callers (see, for example, libsmbns). The public functions in libcryptoutil are resistant against file descriptor leakage and with the very mutexes on which we're locking, let the public interfaces be thread-safe.
Updated by Dan McDonald almost 9 years ago
Found a program that can reproduce the problem. It requires user interaction (press RETURN to fork). I used 128 threads from the command line to make it work (and wait a bit as well), but it's worked well enough three times in a row. Link with -lcryptoutil and -lpkcs11
omnios-stable0% ./a.out 128
Running with 128 random-sampling threads.
Press RETURN to start a fork:
Child pid == 25726
omnios-stable0% pstack 25726 |head
25726: ./a.out 128
fee92ae9 lwp_park (0, 0, 0)
fee8ca55 mutex_lock_impl (fef79120, 0, 68, 518941d, 5b, f6dc3f84) + 291
fee8d36f mutex_lock (fef79120) + 19
fef6531d pkcs11_close_common (fef7911c, fef79120, 0, f6e31e00) + 21
fef653c2 pkcs11_close_urandom (f6e32260, f6d40648, 80479c8, f6dcc779, f6e2f3c4, 0) + 26
f6dcc80c finalize_common (1) + a1
f6dcc923 softtoken_fini (feffb0a4, 8047a04, fefcec06, feffb0a4, fedb0698, 8047a44) + 4b
f6e123a0 _fini (feffc8c0, 0, fedb0698, f, f6d40680, fee0d6d0) + 20
fefd213b call_fini (feffc8c0, f6d40648, fef501a8, 0) + b3
omnios-stable0% kill -9 25726
Once the nightly's built, I'll try with this fix in place.
Updated by Dan McDonald almost 9 years ago
Updated by Electric Monk almost 9 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
git commit 05d57413471eaaa425913a06edc2ab33ad9b05bc
commit 05d57413471eaaa425913a06edc2ab33ad9b05bc Author: Dan McDonald <danmcd@omniti.com> Date: 2014-06-06T18:58:03.000Z 1667 pkcs11 may deadlock when multi-threaded consumers fork Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Gordon Ross <gordon.ross@nexenta.com>