mp_startup_common races itself

Review Request #1173 — Created Aug. 21, 2018 and submitted

jbk
illumos-gate
9760
general

Fixes mp startup so that we don't startup a core until we've finished starting up the previous core.

We've tested this on the hardware used at Joyent, as well as my Broadwell-based personal server (which was really good at triggering the problem). Additionally, people have tested the images with the fix and not reported any boot problems due to it.

citrus
  1. Ship It!
  2. 
      
yuripv
  1. Ship It!
  2. 
      
tsoome
  1. Ship It!
    1. It does not fix network boot with more than 1 core:

        Nazgul.lan -> beastie      NFS C LOOKUP3 FH=C880 acpi_drv
           beastie -> Nazgul.lan   NFS R LOOKUP3 OK FH=6275
        Nazgul.lan -> beastie      NFS C LOOKUP3 FH=E698 cpu_ms.GenuineIntel.6.94.3
           beastie -> Nazgul.lan   NFS R LOOKUP3 No such file or directory
        Nazgul.lan -> beastie      NFS C ACCESS3 FH=6275 (read,modify,extend,execute)
           beastie -> Nazgul.lan   NFS R ACCESS3 OK (read,modify,extend,execute)
        Nazgul.lan -> beastie      NFS C LOOKUP3 FH=5FCA cpu
           beastie -> Nazgul.lan   NFS R LOOKUP3 No such file or directory
        Nazgul.lan -> beastie      NFS C GETATTR3 FH=6275
           beastie -> Nazgul.lan   NFS R GETATTR3 OK
        Nazgul.lan -> beastie      NFS C LOOKUP3 FH=6A4A cpu
           beastie -> Nazgul.lan   NFS R LOOKUP3 No such file or directory
        Nazgul.lan -> beastie      NFS C READ3 FH=6275 at 0 for 4096
           beastie -> Nazgul.lan   NFS R READ3 OK (4096 bytes)
        Nazgul.lan -> beastie      NFS C LOOKUP3 FH=FCC0 cpu_ms.GenuineIntel.6.94.3
           beastie -> Nazgul.lan   NFS R LOOKUP3 No such file or directory
        Nazgul.lan -> beastie      NFS C LOOKUP3 FH=FCC0 cpu_ms.GenuineIntel.6.94.3 (retransmit)
           beastie -> Nazgul.lan   NFS R LOOKUP3 No such file or directory
        Nazgul.lan -> beastie      NFS C READ3 FH=6275 at 0 for 4096 (retransmit)
           beastie -> Nazgul.lan   NFS R READ3 OK (4096 bytes)
      

      There the firmware download will never get completed because the client host fails to receive, but still a good thing:)

    2. This bug has existed for quite some time, but so far Broadwell chips (such as the one in my server) are the only ones we've seen that that really trigger it easily (though from the analysis there's nothing that would restrict the bug to just that chip -- likely just a timing thing).

      Generally if you boot -v, the bug manifests itself as a hang while starting CPUs -- you'll see a few 'cpuXX: initialization complete - online' messages (amongst others) and then a hang. If you see that message for all of your CPUs (or don't see it at all), you haven't hit this bug.

  2. 
      
jbk
Review request changed

Status: Closed (submitted)

Loading...