Bug #1667


pkcs11 may deadlock when multi-threaded consumers fork

Added by Rich Lowe over 11 years ago. Updated almost 9 years ago.

lib - userland libraries
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:
External Bug:


Several times, the Jenkins instance I used to provide putback-triggered
illumos builds and packages has inexplicably hung while polling for new SCM
changes. A message is produced saying that no polling activity is occuring,
and builds cease.

I've, twice, investigated why and found the following things to be true.

  • Jenkins has forked and has a child process
  • This child has not called exec (though presumably wishes it had)
  • This child has existed since approximately the last SCM poll + one interval

The parent process is still, unfortunately, live, so its state as compared to
the child is represented only by the child, to the extent the child has not
modified it post-fork.

The state of the child looks like this:

b38e8008`__lwp_park+0x19(b5154a40, 0, b5eca148, 0)
b38e8048`mutex_lock_impl+0x163(b5eca148, 0, b5e08143, b5ea0018, b5eb0e30, 0)
b38e8088`pkcs11_close_common+0x1c(b5eca110, b5eca148, 1, b5eb677c)
b38e80a8`pkcs11_close_urandom+0x29(0, b5e74000, b38e80d8, b5e7440c, b5e74420, 0)
b38e80f8`softtoken_fini+0x3b(feffb8f4, 0, b5ea0718, b38e814c, fefcdbdf, 0)
b38e810c`_fini+0x25(0, 0, feffb388, feffb2e4, feffb168, b5f10810)
b38e814c`call_fini+0xff(feffb168, b4e60768, b5f10810, 0)
b38e81bc`remove_hdl+0x78c(b5ea0bc8, b5f10810, 0, fefd265a)
b38e81dc`dlclose_core+0xbf(b5ea0bc8, b5f10810, feffb168, fefd2756)
b38e821c`dlclose_intn+0x2a(b5ea0bc8, b5f10810, 0, fefd27c8)
b38e825c`dlclose_check+0x97(b5ea0bc8, b5f10810, 0, fefd288d)
b38e829c`dlclose+0x46(b5ea0bc8, b5efa000, b38e82cc, b5ee5bc2)
b38e82dc`pkcs11_slottable_delete+0xbf(fedc07c0, b5efa000, b38e830c, b5ee182e, 0, b5e74000)
b38e830c`pkcs11_fini+0x46(fedc0780, fedc0080, b38e832c, fee2320d, 0, fef20000)
b38e831c`pkcs11_fork_child_fini+0xb(0, fef20000, b38e835c, fee8c791, fef20000, fef24580)
b38e832c`_postfork_child_handler+0x2f(fef20000, fef24580, fef24540, fef244c0, 0, 93fcd18)
b38e835c`forkx+0x199(0, 0, fee3a642, fee8c844, fe262000)
b38e837c`fork+0x1d(93fcc00, b87a61a8, b87a61b0, e6a0ce24, 15, e6a0cd8c)
b38e8480, b38e847c, b38e8478, 7, b38e8470)
b38e8448 0xfb00a152(0, fb0080f9, e6a0ce70, e6a0ce60, e6a0ce50, 1)
b38e84a8 0xfb002f69(1, e6a0ce18, f, e6a0ca48, 7, e6a0c978)
e6a0ccd0 0xfb06b7f8(0, 0, 0, e6a0ce50, e6a0ce60, e6a0ce70)
00000105 0xb87a6f30()

We just forked, and we're running the atfork handles for libpkcs11, one of
which deletes the slottable, which dlclose's the subordinate modules,
including pkcs11_softtoken, that closes the urandom device via
pkcs11_close_urandom, and tries to take the libcryptoutil &urandom_mutex,
which is already held.

It is already held because, at the moment the parent (main jenkins) process
forked it held &urandom_mutex, and thus it is also held in the child process.
However, the child does not contain the thread which took the mutex, so it
will never be released, and the child is deadlocked under the (child's side of)


1667.c (898 Bytes) 1667.c Dan McDonald, 2014-06-05 03:55 PM
1667.c (1.68 KB) 1667.c Dan McDonald, 2014-06-05 07:11 PM

Related issues

Has duplicate illumos gate - Bug #4909: fork during pkcs11 operation can deadlock child processClosed2014-06-04

Actions #1

Updated by Rich Lowe over 11 years ago

The simple case of this can be fixed with the use of pthread_atfork(), in precisely the manner described in its manual page. However this case cannot be. This is because in this case we are actually within an atfork child handler (that of libpkcs11) that will run before the obvious one in libcryptoutil, so the normal method of fixing this will, instead, guarantee that the bug is hit.

A distasteful option would be to add a routine to libcryptoutil to drop all of that libraries locks, call it from the libpkcs11 fork handler (as well as libcryptoutil's), which should solve the problem, but is obviously nasty.

A better solution would be to rearchitect the way libpkcs11 tears itself down to prevent this whole class of bug (which, looking through the history, has been a constant annoyance).

I'm still undecided whether to apply the tasty solution as a bandaid, and file a bug for the real fix.

Actions #2

Updated by Dan McDonald almost 9 years ago

NOTE: Make sure you see #4909 for a DTrace script to detect these.

The man page refers to the PKCS#11 v2.20 spec (wrong section as well). This version of the spec is unspecific about multithreaded fork behavior.

Actions #3

Updated by Dan McDonald almost 9 years ago

The current method of libcryptoutil's pkcs11_*random_*() interfaces arrived with this commit:

commit 7b79d84636ec82b45f00c982cf6810db81852d17
Author: Dina K Nimeh <>
Date: Fri Mar 20 11:56:57 2009 -0700

6666204 meta slot opens and closes /dev/urandom needlessly for every read
6722460 finish moving /dev/random and /dev/urandom seeding and usage to libc

It introduced public interfaces to libcryptoutil for: pkcs11_close_urandom(), pkcs11_close_random(), and pkcs11_close_urandom_seed().

These are present so pkcs11_softtoken()'s finalize_common() (all three) and meta_Finalize() (both urandom ones) can close the file descriptors as part of a PKCS#11 C_Finalize() operation, or as part of a pthread_atfork() handler in the case of pkcs11_softtoken().

The BETTER approach is to remove these calls from outside libcryptoutil, and add pthread_atfork() handlers for libcryptoutil. I believe (and testing will bear this out) that the way libcryptoutil handles itself, the open file descriptors for /dev/{,u}random will still be there, for use from other, non-softtoken callers (see, for example, libsmbns). The public functions in libcryptoutil are resistant against file descriptor leakage and with the very mutexes on which we're locking, let the public interfaces be thread-safe.

Actions #4

Updated by Dan McDonald almost 9 years ago

Found a program that can reproduce the problem. It requires user interaction (press RETURN to fork). I used 128 threads from the command line to make it work (and wait a bit as well), but it's worked well enough three times in a row. Link with -lcryptoutil and -lpkcs11

omnios-stable0% ./a.out 128
Running with 128 random-sampling threads.
Press RETURN to start a fork:
Child pid == 25726
omnios-stable0% pstack 25726 |head
25726: ./a.out 128
fee92ae9 lwp_park (0, 0, 0)
fee8ca55 mutex_lock_impl (fef79120, 0, 68, 518941d, 5b, f6dc3f84) + 291
fee8d36f mutex_lock (fef79120) + 19
fef6531d pkcs11_close_common (fef7911c, fef79120, 0, f6e31e00) + 21
fef653c2 pkcs11_close_urandom (f6e32260, f6d40648, 80479c8, f6dcc779, f6e2f3c4, 0) + 26
f6dcc80c finalize_common (1) + a1
f6dcc923 softtoken_fini (feffb0a4, 8047a04, fefcec06, feffb0a4, fedb0698, 8047a44) + 4b
f6e123a0 _fini (feffc8c0, 0, fedb0698, f, f6d40680, fee0d6d0) + 20
fefd213b call_fini (feffc8c0, f6d40648, fef501a8, 0) + b3
omnios-stable0% kill -9 25726

Once the nightly's built, I'll try with this fix in place.

Actions #5

Updated by Dan McDonald almost 9 years ago

New version of 1667.c is attached. If you run with argv2 set, it will have a locked child, but NOT on this bug. See #4912 for that bug. No file descriptors leak, however.

Actions #6

Updated by Electric Monk almost 9 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 05d57413471eaaa425913a06edc2ab33ad9b05bc

commit  05d57413471eaaa425913a06edc2ab33ad9b05bc
Author: Dan McDonald <>
Date:   2014-06-06T18:58:03.000Z

    1667 pkcs11 may deadlock when multi-threaded consumers fork
    Reviewed by: Garrett D'Amore <>
    Reviewed by: Saso Kiselkov <>
    Approved by: Gordon Ross <>


Also available in: Atom PDF