Project

General

Profile

Actions

Bug #13982

closed

excessive threads slows down find_elf

Added by Robert Mustacchi 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
tools - gate/build tools
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

i was recently on a system and found that find_elf was taking quite a long time, on the order of 10-15 minutes on a system with 256 threads. I was actually able to reproduce this and found that setting the DMAKE_MAX_JOBS in both cases resulted in sub-optimal time even on a system that had a single socket, faster, and less prototype CPUs. Note, because a lot of testing here was done over and over on the same workspace, most of the data and related was cached in the ARC and what wasn't was on either a SATA or NVMe device depending on the system. However, the microstate time suggests that we were not really

For example, here's the ptime -m output on the larger system:

real    11:38.453445556
user     1:16.246767121
sys     19:35.662120867
trap        0.027121963
tflt        0.000067207
dflt        0.041581569
kflt        0.000000000
lock  5:55:42.802274076
slp      2:31.110492619
lat        47.070167751
stop 43:19:01.297988822

I did a series of experiments on this system and a less prototype system (which had DMAKE_MAX_JOBS set to 34) and found that in all cases limiting the number of threads actually improved things here. in comparison, we see about 10x faster performance with 8 threads:

real     1:06.239679833
user     1:11.588261203
sys      3:48.228552512
trap        0.027870508
tflt        0.007836830
dflt        0.169006699
kflt        0.001164796
lock     1:17.338284433
slp      2:11.645642483
lat         3.808370936
stop     5:32.783047233

The reason there's a dramatic difference in stop time here is because fork1() must still actually get all threads serialized. When looking at the fork rate with DTrace (e.g. instrumenting cfork with a predicate), in the large thread case I saw a consistent 12 forks per second. Note that there is other work going on with each fork; however, when set to 8 jobs, it was going between 95-120 depending on this prototype, lower frequency system. This makes sense, as there's a lot less threads to try and coordinate and stop, reducing the overall cost even if we're not actually duplicating them all via forkall().

For context, here is the approximate ptime -m output from several different chosen values:

1 thread:

real     2:57.265522813
user       47.469055181
sys      1:57.157531583
trap        0.005804202
tflt        0.000000000
dflt        0.000173237
kflt        0.000000000
lock        0.013249417
slp      5:35.036242456
lat         1.343701798
stop       24.036933125

4 threads:

real     1:09.665161596
user       58.943810439
sys      3:02.334504686
trap        0.019857111
tflt        0.001991922
dflt        0.045399272
kflt        0.000873948
lock        1.524442200
slp      3:09.884952247
lat         2.037143747
stop     2:23.712300256

16 threads:

real     1:33.188876060
user     1:15.710282746
sys      4:25.749822644
trap        0.029206395
tflt        0.003432280
dflt        0.140559800
kflt        0.000824983
lock     5:54.414623916
slp      2:06.719068022
lat         5.540346113
stop    16:56.440859479

32 threads:

real     2:15.436185160
user     1:16.567693330
sys      5:24.169648195
trap        0.035824669
tflt        0.000909151
dflt        0.082744993
kflt        0.000640469
lock    16:05.650229058
slp      1:57.733760604
lat         8.507864634
stop    54:01.475709686

Based on all this I believe we should just take a prudent attempt and set a max of 8 threads for the time being on this.


Related issues

Related to illumos gate - Feature #13248: parallelise the quest for elvesClosedAndy Fiddaman

Actions
Actions #1

Updated by Robert Mustacchi 3 months ago

Actions #2

Updated by Robert Mustacchi 3 months ago

  • Subject changed from excesive threads slows down find_elf to excessive threads slows down find_elf
Actions #3

Updated by Robert Mustacchi 3 months ago

I tested this by verifying that in an environment where DMAKE_MAX_JOBS is set higher than the thread count that pflags reported that we had limited the size of the thread pool correctly during the build (and manually).

Actions #4

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 80 to 100

git commit 1d806c5f41e9c01235a8d6889ebc483b3de3cf4c

commit  1d806c5f41e9c01235a8d6889ebc483b3de3cf4c
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2021-07-30T20:28:41.000Z

    13982 excessive threads slows down find_elf
    Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
    Reviewed by: Andy Fiddaman <andy@omnios.org>
    Reviewed by: Toomas Soome <tsoome@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF