Bug #13982


excessive threads slows down find_elf

Added by Robert Mustacchi about 1 year ago. Updated about 1 year ago.

tools - gate/build tools
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


i was recently on a system and found that find_elf was taking quite a long time, on the order of 10-15 minutes on a system with 256 threads. I was actually able to reproduce this and found that setting the DMAKE_MAX_JOBS in both cases resulted in sub-optimal time even on a system that had a single socket, faster, and less prototype CPUs. Note, because a lot of testing here was done over and over on the same workspace, most of the data and related was cached in the ARC and what wasn't was on either a SATA or NVMe device depending on the system. However, the microstate time suggests that we were not really

For example, here's the ptime -m output on the larger system:

real    11:38.453445556
user     1:16.246767121
sys     19:35.662120867
trap        0.027121963
tflt        0.000067207
dflt        0.041581569
kflt        0.000000000
lock  5:55:42.802274076
slp      2:31.110492619
lat        47.070167751
stop 43:19:01.297988822

I did a series of experiments on this system and a less prototype system (which had DMAKE_MAX_JOBS set to 34) and found that in all cases limiting the number of threads actually improved things here. in comparison, we see about 10x faster performance with 8 threads:

real     1:06.239679833
user     1:11.588261203
sys      3:48.228552512
trap        0.027870508
tflt        0.007836830
dflt        0.169006699
kflt        0.001164796
lock     1:17.338284433
slp      2:11.645642483
lat         3.808370936
stop     5:32.783047233

The reason there's a dramatic difference in stop time here is because fork1() must still actually get all threads serialized. When looking at the fork rate with DTrace (e.g. instrumenting cfork with a predicate), in the large thread case I saw a consistent 12 forks per second. Note that there is other work going on with each fork; however, when set to 8 jobs, it was going between 95-120 depending on this prototype, lower frequency system. This makes sense, as there's a lot less threads to try and coordinate and stop, reducing the overall cost even if we're not actually duplicating them all via forkall().

For context, here is the approximate ptime -m output from several different chosen values:

1 thread:

real     2:57.265522813
user       47.469055181
sys      1:57.157531583
trap        0.005804202
tflt        0.000000000
dflt        0.000173237
kflt        0.000000000
lock        0.013249417
slp      5:35.036242456
lat         1.343701798
stop       24.036933125

4 threads:

real     1:09.665161596
user       58.943810439
sys      3:02.334504686
trap        0.019857111
tflt        0.001991922
dflt        0.045399272
kflt        0.000873948
lock        1.524442200
slp      3:09.884952247
lat         2.037143747
stop     2:23.712300256

16 threads:

real     1:33.188876060
user     1:15.710282746
sys      4:25.749822644
trap        0.029206395
tflt        0.003432280
dflt        0.140559800
kflt        0.000824983
lock     5:54.414623916
slp      2:06.719068022
lat         5.540346113
stop    16:56.440859479

32 threads:

real     2:15.436185160
user     1:16.567693330
sys      5:24.169648195
trap        0.035824669
tflt        0.000909151
dflt        0.082744993
kflt        0.000640469
lock    16:05.650229058
slp      1:57.733760604
lat         8.507864634
stop    54:01.475709686

Based on all this I believe we should just take a prudent attempt and set a max of 8 threads for the time being on this.

Related issues

Related to illumos gate - Feature #13248: parallelise the quest for elvesClosedAndy Fiddaman


Also available in: Atom PDF