Actions
Bug #4909
closedfork during pkcs11 operation can deadlock child process
Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2014-06-04
Due date:
% Done:
0%
Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:
External Bug:
Description
Repeatedly see this on an OmniOS r151006 system, running Oracle Java 7 and Jenkins. After some period of run time, a Jenkins SCM polling operation will get stuck and be unrecoverable until the whole app is shut down.
The stuck process has a stack like:
44001: java -Dsvnkit.http.sslProtocols=SSLv3 -Xmx512m -jar /jenkins/app/jenki feea0f69 lwp_park (0, 0, 0) fee9aed5 mutex_lock_impl (d3919120, 0, 7601, 518941d, 5b, d3883f84) + 291 fee9b7ef mutex_lock (d3919120) + 19 d390531d pkcs11_close_common (d391911c, d3919120, 0, d38f1e00) + 21 d39053c2 pkcs11_close_urandom (d38f2260, d03a0fb0, d1d83d1c, d388c779, d38ef3c4, 0) + 26 d388c80c finalize_common (1) + a1 d388c923 softtoken_fini (feffb0a4, d1d83d88, fefd213b, feffc8c0, 0, d3920818) + 4b d38d238a _fini (feffc8c0, 0, d3920818, f, d3980660, fee1d23c) + 1a fefd213b call_fini (feffc8c0, d03a0fb0, d3980c18, 0) + b3 fefd8727 remove_hdl (d3920cc0, d3980c18, 0, fefd4225, fef14000, d2adea40) + 744 fefd42cf dlclose_core (d3920cc0, d3980c18, feffc8c0, fef181c0, 0, fefceb85) + b8 fefd431f dlclose_intn (d3920cc0, d3980c18, fefd4391, d3920cc0) + 25 fefd4405 dlclose_check (d3920cc0, d3980c18, d1d83ef8, d395c000, 8686ca0, 8686ca8) + 92 fefd444d dlclose (d3920cc0, fedb0080, d1d83f28, 0) + 3d d39443ee pkcs11_slottable_delete (d395c45c, 0, 18, d395c000, fd960680, fedb0080) + b5 d394034d finalize_common (0, fedb0080, d1d83f88, d388c761, d38ef3c4, fef14000) + 10f d39403ac pkcs11_fini (fef14000, fd960640, d1d83f98, fee1d524, d2adea40, fef181c0) + 57 d39403cd pkcs11_fork_child_fini (d2adea40, fef181c0, d1d83fd8, fee942c9, fef18240, d2adebcc) + b fee1d524 _postfork_child_handler (fef18240, d2adebcc, fed933b0, fef14000, fef14824, fef18240) + 23 fee942c9 forkx (0, 85593b0, c, 700, 0) + 19c fee94373 fork (d2adea40, fef14824, d1d84048, 0, fef14824, b0487e8) + 1e fd85d75b startChild (93cb130, d1d8418c, b0487e8, b048b68) + 2b fd85de66 Java_java_lang_UNIXProcess_forkAndExec (93cb130, d1d8418c, 1, d1d84184, d1d84180, d1d8417c) + 2da fa60b158 * java/lang/UNIXProcess.forkAndExec(I[B[B[BI[BI[B[IZ)I+0 fad50e0c * *java/lang/UNIXProcess.<init>([B[BI[BI[B[IZ)V [compiled] +29 (line 271) fad50e0c * *java/lang/ProcessImpl.start([Ljava/lang/String;Ljava/util/Map;Ljava/lang/String;[Ljava/lang/ProcessBuilder$Redirect;Z)Ljava/lang/Process;+434 (line 207)
With Dan McDonald's help, I captured the act when it happened (forking while pkcs11/libcryptoutil holds urandom_mutex) using the attached D script, which showed me this:
CPU ID FUNCTION:NAME 11 12265 forksys:entry Holy shit, busted! 18 10243 mutex_unlock_queue:mutex-release libc.so.1`mutex_unlock_queue+0x2f libc.so.1`mutex_unlock+0x11b libcryptoutil.so.1`pkcs11_read_common+0x3b libcryptoutil.so.1`pkcs11_read_urandom+0x2c libcryptoutil.so.1`pkcs11_get_urandom+0x26 pkcs11_softtoken.so.1`C_GenerateRandom+0xb9 libpkcs11.so.1`C_GenerateRandom+0x70 libj2pkcs11.so`Java_sun_security_pkcs11_wrapper_PKCS11_C_1GenerateRandom+0x76 0xfa60b158 0xfa6035fb 0xfab8f350
The D script takes two arguments:
./fork-while-mutex.d <pid> 0x<mutex_addr>
where "mutex_addr" is the hex addr of the mutex in question, which I obtained with:
echo urandom_mutex=J | mdb -p <pid>
Files
Related issues
Actions