excessive threads slows down find_elf
i was recently on a system and found that find_elf was taking quite a long time, on the order of 10-15 minutes on a system with 256 threads. I was actually able to reproduce this and found that setting the DMAKE_MAX_JOBS in both cases resulted in sub-optimal time even on a system that had a single socket, faster, and less prototype CPUs. Note, because a lot of testing here was done over and over on the same workspace, most of the data and related was cached in the ARC and what wasn't was on either a SATA or NVMe device depending on the system. However, the microstate time suggests that we were not really
For example, here's the ptime -m output on the larger system:
real 11:38.453445556 user 1:16.246767121 sys 19:35.662120867 trap 0.027121963 tflt 0.000067207 dflt 0.041581569 kflt 0.000000000 lock 5:55:42.802274076 slp 2:31.110492619 lat 47.070167751 stop 43:19:01.297988822
I did a series of experiments on this system and a less prototype system (which had DMAKE_MAX_JOBS set to 34) and found that in all cases limiting the number of threads actually improved things here. in comparison, we see about 10x faster performance with 8 threads:
real 1:06.239679833 user 1:11.588261203 sys 3:48.228552512 trap 0.027870508 tflt 0.007836830 dflt 0.169006699 kflt 0.001164796 lock 1:17.338284433 slp 2:11.645642483 lat 3.808370936 stop 5:32.783047233
The reason there's a dramatic difference in stop time here is because fork1() must still actually get all threads serialized. When looking at the fork rate with DTrace (e.g. instrumenting cfork with a predicate), in the large thread case I saw a consistent 12 forks per second. Note that there is other work going on with each fork; however, when set to 8 jobs, it was going between 95-120 depending on this prototype, lower frequency system. This makes sense, as there's a lot less threads to try and coordinate and stop, reducing the overall cost even if we're not actually duplicating them all via forkall().
For context, here is the approximate ptime -m output from several different chosen values:
real 2:57.265522813 user 47.469055181 sys 1:57.157531583 trap 0.005804202 tflt 0.000000000 dflt 0.000173237 kflt 0.000000000 lock 0.013249417 slp 5:35.036242456 lat 1.343701798 stop 24.036933125
real 1:09.665161596 user 58.943810439 sys 3:02.334504686 trap 0.019857111 tflt 0.001991922 dflt 0.045399272 kflt 0.000873948 lock 1.524442200 slp 3:09.884952247 lat 2.037143747 stop 2:23.712300256
real 1:33.188876060 user 1:15.710282746 sys 4:25.749822644 trap 0.029206395 tflt 0.003432280 dflt 0.140559800 kflt 0.000824983 lock 5:54.414623916 slp 2:06.719068022 lat 5.540346113 stop 16:56.440859479
real 2:15.436185160 user 1:16.567693330 sys 5:24.169648195 trap 0.035824669 tflt 0.000909151 dflt 0.082744993 kflt 0.000640469 lock 16:05.650229058 slp 1:57.733760604 lat 8.507864634 stop 54:01.475709686
Based on all this I believe we should just take a prudent attempt and set a max of 8 threads for the time being on this.