Bug #7912

nfs_rwlock readers are running wild waiting for a writer that cannot come

Added by Marcel Telka about 1 month ago. Updated 7 days ago.

Status:ClosedStart date:2017-02-27
Priority:HighDue date:
Assignee:Marcel Telka% Done:

100%

Category:nfs - NFS server and client
Target version:-
Difficulty:Medium Tags:needs-triage

Description

We have 5 threads trying to acquire a nfs_rwlock as a reader:

> ffffff00872e8c40::findstack -v
stack pointer for thread ffffff00872e8c40: ffffff00872e8aa0
[ ffffff00872e8aa0 _resume_from_idle+0xf4() ]
  ffffff00872e8ad0 swtch+0x141()
  ffffff00872e8b10 cv_wait+0x70(ffffff130c2c0c90, ffffff130c2c0c88)
  ffffff00872e8b70 nfs_rw_enter_sig+0x1b6(ffffff130c2c0c78, 1, 0)
  ffffff00872e8c20 nfs4delegreturn_thread+0x159(ffffff130dc9b088)
  ffffff00872e8c30 thread_start+8()
> ffffff0085e36c40::findstack -v
stack pointer for thread ffffff0085e36c40: ffffff0085e36aa0
[ ffffff0085e36aa0 _resume_from_idle+0xf4() ]
  ffffff0085e36ad0 swtch+0x141()
  ffffff0085e36b10 cv_wait+0x70(ffffff130c2c0c90, ffffff130c2c0c88)
  ffffff0085e36b70 nfs_rw_enter_sig+0x1b6(ffffff130c2c0c78, 1, 0)
  ffffff0085e36c20 nfs4delegreturn_thread+0x159(ffffff131f2ff4a0)
  ffffff0085e36c30 thread_start+8()
> ffffff0095754c40::findstack -v
stack pointer for thread ffffff0095754c40: ffffff0095754aa0
[ ffffff0095754aa0 _resume_from_idle+0xf4() ]
  ffffff0095754ad0 swtch+0x141()
  ffffff0095754b10 cv_wait+0x70(ffffff130c2c0c90, ffffff130c2c0c88)
  ffffff0095754b70 nfs_rw_enter_sig+0x1b6(ffffff130c2c0c78, 1, 0)
  ffffff0095754c20 nfs4delegreturn_thread+0x159(ffffff1311feab38)
  ffffff0095754c30 thread_start+8()
> ffffff00957fcc40::findstack -v
stack pointer for thread ffffff00957fcc40: ffffff00957fcaa0
[ ffffff00957fcaa0 _resume_from_idle+0xf4() ]
  ffffff00957fcad0 swtch+0x141()
  ffffff00957fcb10 cv_wait+0x70(ffffff130c2c0c90, ffffff130c2c0c88)
  ffffff00957fcb70 nfs_rw_enter_sig+0x1b6(ffffff130c2c0c78, 1, 0)
  ffffff00957fcc20 nfs4delegreturn_thread+0x159(ffffff1301f311e0)
  ffffff00957fcc30 thread_start+8()
> ffffff0083532c40::findstack -v
stack pointer for thread ffffff0083532c40: ffffff0083532aa0
[ ffffff0083532aa0 _resume_from_idle+0xf4() ]
  ffffff0083532ad0 swtch+0x141()
  ffffff0083532b10 cv_wait+0x70(ffffff130c2c0c90, ffffff130c2c0c88)
  ffffff0083532b70 nfs_rw_enter_sig+0x1b6(ffffff130c2c0c78, 1, 0)
  ffffff0083532c20 nfs4delegreturn_thread+0x159(ffffff195381be88)
  ffffff0083532c30 thread_start+8()
>

and one more thread trying to acquire the same lock as a writer:

> ffffff12ebc2f080::findstack -v
stack pointer for thread ffffff12ebc2f080: ffffff009e6acc40
[ ffffff009e6acc40 _resume_from_idle+0xf4() ]
  ffffff009e6acc70 swtch+0x141()
  ffffff009e6accb0 cv_wait+0x70(ffffff130c2c0c90, ffffff130c2c0c88)
  ffffff009e6acd10 nfs_rw_enter_sig+0xd0(ffffff130c2c0c78, 0, 0)
  ffffff009e6acd60 nfs4_rwlock+0x52(ffffff1320d95880, 1, 0)
  ffffff009e6acdc0 fop_rwlock+0x2d(ffffff1320d95880, 1, 0)
  ffffff009e6ace90 write+0x1ae(f, 865ac30, 800)
  ffffff009e6acec0 write32+0x1e(f, 865ac30, 800)
  ffffff009e6acf10 _sys_sysenter_post_swapgs+0x149()
>

The reader threads are spinning, while the writer thread is sleeping (output from THREADPTR ::thread -mdi commands):

            ADDR    STATE  FLG PFLG SFLG   PRI  EPRI PIL             INTR         DISPTIME BOUND PR SWITCH
ffffff00872e8c40 sleep       8 2000    3    60     0   0              n/a           b0a181    -1  0 t-0   
ffffff0085e36c40 sleep       8 2000    3    60     0   0              n/a           b0a52a    -1  0 t-0   
ffffff0095754c40 sleep       8 2000    3    60     0   0              n/a           b0aa32    -1  0 t-0   
ffffff00957fcc40 sleep       8 2000    3    60     0   0              n/a           b0adf5    -1  0 t-0   
ffffff0083532c40 onproc      8 2000   13    60     0   0              n/a           b0b1b4    -1  0 t-0   
ffffff12ebc2f080 sleep    1000  104    3    59     0   0              n/a           130dcd    -1  0 t-10332220

Related issues

Related to illumos gate - Bug #7601: nfs_rwlock_t does not scale with cv_broadcast() Closed 2016-11-21

History

#1 Updated by Marcel Telka about 1 month ago

  • Related to Bug #7601: nfs_rwlock_t does not scale with cv_broadcast() added

#2 Updated by Marcel Telka about 1 month ago

Root cause

Fix for #7601 assumed that cv_wait(9f)/cv_wait_sig(9f) places the calling thread at the end of the waiting queue while cv_signal(9f) always wakes up the first waiting thread in the queue. The implementation is not so straightforward and thread priorities are considered too. Threads with higher priority are queued before lower priority threads even they arrived after the lower priority threads.

In the example above (in the Description) the waiting writer have priority 59, while all readers have priority 60. Once a reader is woken up, it realizes there is a writer waiting, so it calls cv_signal(9f) to wake up the next thread and goes back to sleep by calling cv_wait(9f). The thread is queued before the writer with the lower priority. The woken up reader is going back to sleep because the nfs_rwlock implementation prefers writers over readers. The next woken up thread is again a reader with priority 60 and does the same: finds there is a writer sleeping, calls cv_signal(9f), and goes back to sleep by calling cv_wait(9f), and again, it is queued before the waiting writer. And so on, and so on.

#3 Updated by Gordon Ross 30 days ago

Ah, so we have a classic "priority inversion" problem?

The usual solution (taught in many CS courses on the subject) is to temporarily elevate the priority of the resource holder to the same priority as the highest priority blocked thread. Can we do that here?

#4 Updated by Marcel Telka 30 days ago

Gordon Ross wrote:

Ah, so we have a classic "priority inversion" problem?

Partially. Partially the problem is that nfs_rwlock prefers writers over readers.

The usual solution (taught in many CS courses on the subject) is to temporarily elevate the priority of the resource holder to the same priority as the highest priority blocked thread. Can we do that here?

Maybe, but I'm not sure there is an easy implementation of that possible. I'm working on something simple, tailored especially for the nfs_rwlock's needs.

#5 Updated by Marcel Telka 29 days ago

Fix description

The fix introduces new separate condition variable for waiting readers. In a case a new waiter(s) needs to be woken up either a waiting writer is signalled (via the cv condition variable) or all waiting readers are broadcasted (via the newly introduced cv_rd condition variable).

Review: https://www.illumos.org/rb/r/385/

#6 Updated by Marcel Telka 23 days ago

  • Status changed from In Progress to Pending RTI

#7 Updated by Electric Monk 7 days ago

  • % Done changed from 0 to 100
  • Status changed from Pending RTI to Closed

git commit 7909625fdb7ecb20e9b7a777cfc0ec7ee63b4642

commit  7909625fdb7ecb20e9b7a777cfc0ec7ee63b4642
Author: Marcel Telka <marcel@telka.sk>
Date:   2017-03-22T21:08:11.000Z

    7912 nfs_rwlock readers are running wild waiting for a writer that cannot come
    Reviewed by: Arne Jansen <arne@die-jansens.de>
    Reviewed by: Gordon Ross <Gordon.W.Ross@gmail.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom