Project

General

Profile

Bug #4198

Assertion failed in nlm_slock_wait()

Added by Alasdair Lumsden about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
nfs - NFS server and client
Start date:
2013-10-11
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

^Mpanic[cpu23]/thread=ffffff19beb260a0:
assertion failed: nslp->nsl_state == NLM_SL_GRANTED, file: ../../common/klm/nlm_
impl.c, line: 1959

ffffff00bcf78750 fffffffffba66b88 ()
ffffff00bcf787b0 klmmod:nlm_slock_wait+86 ()
ffffff00bcf78900 klmmod:nlm_call_lock+2a0 ()
ffffff00bcf789d0 klmmod:nlm_frlock_setlk+1ca ()
ffffff00bcf78a80 klmmod:nlm_frlock+162 ()
ffffff00bcf78ae0 klmops:lm4_frlock+3e ()
ffffff00bcf78bc0 nfs:nfs3_frlock+3c0 ()
ffffff00bcf78c60 genunix:fop_frlock+65 ()
ffffff00bcf78ec0 genunix:fcntl+5bd ()
ffffff00bcf78f10 unix:brand_sys_sysenter+1c9 ()

syncing file systems...

System is running a very recent custom SmartOS build with a number of customer zones on it.

Can provide crash dump, although its from a production system with customer data so would need to send directly.

This is after just 3 days of uptime, so this is a serious bug.

#1

Updated by Marcel Telka about 7 years ago

  • Assignee set to Marcel Telka

We panicked here (line 1959):

1931    if (nslp->nsl_state == NLM_SL_BLOCKED) {
1932        cv_res = cv_timedwait_sig(&nslp->nsl_cond,
1933            &g->lock, timeo_ticks);
1934    }
1935
1936    /*
1937     * No matter why we wake up, if the lock was
1938     * cancelled, let the function caller to know
1939     * about it by returning EINTR.
1940     */
1941    if (nslp->nsl_state == NLM_SL_CANCELLED) {
1942        error = EINTR;
1943        goto out;
1944    }
1945
1946    if (cv_res <= 0) {
1947        /* We was woken up either by timeout or interrupt */
1948        error = (cv_res < 0) ? ETIMEDOUT : EINTR;
1949
1950        /*
1951         * The granted message may arrive after the
1952         * interrupt/timeout but before we manage to lock the
1953         * mutex. Detect this by examining nslp.
1954         */
1955        if (nslp->nsl_state == NLM_SL_GRANTED)
1956            error = 0;
1957    } else { /* awaken via cv_signal or didn't block */
1958        error = 0;
1959        VERIFY(nslp->nsl_state == NLM_SL_GRANTED);
1960    }

The one possible root cause might be that the cv_timedwait_sig() at line 1932 might return >0 even in a case the cv was not signaled/broadcasted (see NOTES in condvar(9f) for details).

I'll know more once I have the crash dump file (it is just downloading).

#2

Updated by Marcel Telka about 7 years ago

  • Status changed from New to In Progress
#3

Updated by Marcel Telka about 7 years ago

  • Subject changed from Kernel panic in new NFS Lock Manager to Assertion failed in nlm_slock_wait()
#4

Updated by Marcel Telka about 7 years ago

I analyzed the source code and there seems to be no other root cause for this issue, only what I described in Note 1.

#5

Updated by Garrett D'Amore about 7 years ago

Yep, clearly the code is broken with respect to the contract of cv_timedwait_sig(). That "if" should be rewritten as a while loop:

while (nslp->nsl_state == NLM_SL_BLOCKED) {
        cv_res = cv_timedwait_sig(&nslp->nsl_cond, &g->lock, timeo_ticks);
        if (cv_res <= 0)
                break;
}
#6

Updated by Marcel Telka about 7 years ago

To be sure, I analyzed all of the cv_wait-like calls in the new NLM code (we do have five of them):

$ egrep -n "cv_.*wait" usr/src/uts/common/klm/*
usr/src/uts/common/klm/nlm_impl.c:357:        (void) cv_timedwait(&g->nlm_gc_sched_cv, &g->lock,
usr/src/uts/common/klm/nlm_impl.c:1297:        rc = cv_timedwait_sig(&hostp->nh_recl_cv,
usr/src/uts/common/klm/nlm_impl.c:1932:        cv_res = cv_timedwait_sig(&nslp->nsl_cond,
usr/src/uts/common/klm/nlm_impl.c:2377:        cv_wait(&g->nlm_gc_finish_cv, &g->lock);
usr/src/uts/common/klm/nlm_rpc_handle.c:209:            rc = cv_wait_sig(&hostp->nh_rpcb_cv, &hostp->nh_lock);
$

and it looks like that nlm_impl.c:1932 is just only such issue we have.

#7

Updated by Marcel Telka about 7 years ago

  • Status changed from In Progress to Pending RTI
#8

Updated by Marcel Telka about 7 years ago

From condvar(9f):

RETURN VALUES

...

     >0                                                                                                                      
              For      cv_wait_sig(),      cv_timedwait()      or
              cv_timedwait_sig() indicates that the condition was
              met and the function returned  due  to  a  call  to
              cv_signal()  or  cv_broadcast(), or due to a prema-
              ture wakeup (see NOTES).

...

NOTES
     It is possible for cv_wait(), cv_wait_sig(), cv_timedwait(),
     and  cv_timedwait_sig()  to return prematurely, that is, not

Illumos             Last change: Dec 15, 2003                   4

Kernel Functions for Drivers                          CONDVAR(9F)

     due to a call to cv_signal() or cv_broadcast(). This  occurs
     most   commonly   in   the   case   of   cv_wait_sig()   and
     cv_timedwait_sig() when the thread is stopped and  restarted
     by  job  control signals or by a debugger, but can happen in
     other cases as well, even for  cv_wait().  Code  that  calls
     these  functions must always recheck the reason for blocking
     and call again if the reason for blocking is still true.
#9

Updated by Garrett D'Amore about 7 years ago

  • Status changed from Pending RTI to Resolved
  • % Done changed from 0 to 100

4198 Assertion failed in nlm_slock_wait()
Reviewed by: Garrett D'Amore <>
Reviewed by: Gordon Ross <>
Reviewed by: Richard Lowe <>
Approved by: Garrett D'Amore <>

Also available in: Atom PDF