nfs_rwlock_t does not scale with cv_broadcast()
The nfs_rwlock_t implementation uses
cv_broadcast(9f) to wake up the waiters. This is clearly visible in nfs_rw_exit().
In a case there are several WRITERs waiting all of them will wake up, then one thread will grab the lock, but all others will just go back and wait again in cv_wait()/cv_wait_sig() - see nfs_rw_enter_sig(). This might cause significant performance degradation in some scenarios. To fix this the nfs_rwlock_t implementation should be changed to utilize
cv_signal(9f) instead of
Updated by Marcel Telka about 4 years ago
To easily and clearly demonstrate this issue we should find a case when there might be many threads trying to acquire one nfs_rwlock_t as a WRITER. Such a case is the nfs4_frlock() implementation doing the following (simplified):
- call nfs_rw_enter_sig(RW_WRITER)
- send the NFSv4 LOCK operation to the server
- in a case the NFSv4 LOCK operation is denied:
If we will fire many threads trying to lock the same file over NFSv4 we should see some problem. I implemented a test trying to lock one file in a cycle from 2000 processes for one hour.
The test is implemented to run in three phases. In phase 1 there are 2000 processes forked and a file is opened (over NFSv4). Once the phase 1 is completed phase 2 begins. In phase 2 all processes are doing the following:
- if the test time passed (one hour) then return
- fcntl(file, F_SETLKW, F_WRLCK)
- sleep(1 second)
- fcntl(file, F_SETLKW, F_UNLCK)
- sleep(1 second)
- goto 1
Once the test time passes (one hour) the phase 3 begins. Here all the processes completes their lock/unlock cycle (see above), close the file and exit(). The test counts the successful NFSv4 LOCK operations done by all processes.
I ran the test for the current nfs_rwlock_t implementation and for the new implementation done using
cv_signal(9f). The test was run on a (virtual) 4 core machine with 6GB of RAM. The results are in the attached test.svg file. The timestamps in the picture are relative, with 0:00 set to the phase 2 begin.
- With the cv_broadcast() implementation the load in phase 2 peaks at about 200 (!). After ca 30 minutes the load went down and stabilized at about 35, in phase 3 the load goes slowly down.
- With the cv_signal() implementation the load in phase 2 is almost zero (actually 0.02). In phase 3 the load goes up a bit to about 2.
- Both implementations ended with the similar "load hill" peaking at about load 15-20. Interesting, but out of the scope of this work.
- The overall test run is about 7 hours.
- The overall test run time for the cv_signal() implementation is a bit shorter. The time difference is about 30 minutes. This is also good.
- The number of successful NFSv4 lock/unlock cycles was similar for both implementations.
Updated by Electric Monk almost 4 years ago
- Status changed from Pending RTI to Closed
- % Done changed from 0 to 100
commit d1054fda727523967470132ee7266d3d934b5124 Author: Marcel Telka <firstname.lastname@example.org> Date: 2016-12-07T16:33:31.000Z 7601 nfs_rwlock_t does not scale with cv_broadcast() Reviewed by: Simon Klinkert <email@example.com> Reviewed by: Toomas Soome <firstname.lastname@example.org> Approved by: Dan McDonald <email@example.com>
Updated by Marcel Telka almost 4 years ago
Q: If we have only readers (comment starting at line 4820), wouldn't it make sense to use
cv_broadcast() there, so we can get readers to finish what they are doing and we are not serializing the reads as
cv_signal() would cause it to happen?
A: Good question. The similar "problem" is at line 4790.
Actually, even with the
cv_broadcast() call the waiting (and now woken up) readers are serialized, because they need to grab the mutex before they return from the
The problem with the
cv_broadcast() call is that when we wake up a lot of readers it is possible that one (new) writer might grab the mutex before some of the readers actually returns from the
cv_wait_sig(). If that's the case, then the writer will take the precedence and cause all of the following readers to spin and call
cv_wait_sig() again. I just tried to avoid that.
One might argue that the several
cv_signal() calls in a row are slower than one
cv_broadcast() call, even with a risk that sometimes the woken up threads (by
cv_broadcast()) might just return back to
cv_wait_sig() again. Yes, that's easily possible, but I do not know for sure. I didn't do any measurement for that.
... the line 4826 should be executed in rare cases only, so it doesn't matter much if there is either
cv_signal() call. The performance difference would be almost zero. Also, if we wake up more than one reader here (at line 4826), all of them will call the
cv_signal() at line 4790. So the situation would be very likely worse with the
cv_broadcast() at line 4826 than it will be with the
The 4790 line is a bit different story. If there is a
cv_broadcast() call there, all of the woken up readers will call the
cv_broadcast() again (at line 4790). And this is definitely something we do not want to see.