Project

General

Profile

Bug #349

hang during network boot (circular kcf dependency)

Added by Garrett D'Amore almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
High
Category:
kernel
Start date:
2010-10-15
Due date:
% Done:

100%

Estimated time:
Difficulty:
Tags:

Description

Bryan Cantrill reported an interested deadlock/hang in kcf during netboot:

Has anyone been able to get illumos to net boot? We're seeing a hang
on boot (after/while plumbing the dld streams module), and it would be
very helpful to know if someone has gotten this to work (or is trying
to, for that matter)...

- Bryan

Just to follow up on this: as it turns out, this is an odd bug due to
a circular dependency between the kcf and swrand drivers. When we
boot over the network (or more accurately, from a ramdisk that has
itself been loaded by pxegrub or gPXE), we load kcf as one of dev/ip's
dependencies; kcf's _init, in turn, calls kcf_rnd_schedule_timeout(),
which calls crypto_mech2id_common(), which does an explicit modload()
of "crypto/swrand". The problem is that crypto/swrand has a
dependency on kcf -- which is busy (recall that we got here by trying
to load it), and this code path doesn't detect circular dependencies.
When one does not boot over the network, we load zfs before kcf or
swrand -- and zfs has an explicit dependency on swrand which then
pulls in kcf. Of course, this code path still calls
crypto_mech2id_common(), but because this loads a module by name, it
detects the circular dependency and kicks out. This bug should be
fixed (there are several ways to fix it), but we worked around it by
adding a "forceload: crypto/swrand" directive to /etc/system.

- Bryan

History

#1

Updated by Garrett D'Amore almost 9 years ago

  • % Done changed from 0 to 80

The bug is a regression due to my integration of the fix for "6 Need open kcfd".

The problem is pretty obviously simple to fix, we just need to get the lookup of the rng mechanism ID off of the context of the thread that is starting up. This is precisely what taskq's are for, and there is the system_taskq just for these single shot jobs.

The fix looks like this:

garrett@thinkpad{20}> hg diff
diff -r f6152e8361fb usr/src/uts/common/crypto/api/kcf_random.c
--- a/usr/src/uts/common/crypto/api/kcf_random.c    Fri Oct 15 11:23:37 2010 -0700
+++ b/usr/src/uts/common/crypto/api/kcf_random.c    Fri Oct 15 14:39:54 2010 -0700
@@ -834,13 +834,22 @@
     }
 }

+static void
+rnd_mechid(void *notused)
+{
+    _NOTE(ARGUNUSED(notused));
+    rngmech_type = crypto_mech2id(SUN_RANDOM);
+}
+
 void
 kcf_rnd_schedule_timeout(boolean_t do_mech2id)
 {
     clock_t ut;    /* time in microseconds */

-    if (do_mech2id)
-        rngmech_type = crypto_mech2id(SUN_RANDOM);
+    if (do_mech2id) {
+        /* This should never fail due to TQ_SLEEP. */
+        (void) taskq_dispatch(system_taskq, rnd_mechid, NULL, TQ_SLEEP);
+    }

     /*
      * The new timeout value is taken from the buffer of random bytes.
#2

Updated by Garrett D'Amore almost 9 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

I went ahead and integrated this into cset fed843d791e9.

Also available in: Atom PDF