8040 NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages()

Review Request #591 - Created June 19, 2017 and updated

Information
Marcel Telka
illumos-gate
master
8040
2ce9d73...
Reviewers
general
This fixes a 3-way deadlock in NFSv4 client.
I tested the fix using the reproducer from the bug report and I confirmed that
the deadlock is no longer reproducible.

Issues

  • 0
  • 0
  • 1
  • 1
Description From Last Updated
Marcel Telka
Review request changed

Change Summary:

Removed an obsolete comment from nfs4close_otw().

Commit:

-8b8c95b2b0f7db56cef8ec323f92cb0fb18be756
+2ce9d73a97b154f174a6640332aa29ca16e6510b

Diff:

Revision 2 (+7 -31)

Show changes

Arne Jansen
Ship It!
Vitaliy Gusev

   
usr/src/uts/common/fs/nfs/nfs4_client.c (Diff revision 2)
 
 

I see that r_serial is set only in this function and at the end always is reset to NULL.

Is it possible that @was_serial is always FALSE here or I missed something ? In other words, if @was_serial is TRUE, it should mean that nfs4_attr_cache() is nested call, but I don't see where it can happen.

  1. Maybe there is a nested call possible via the nfs4_attr_cache() -> nfs4_purge_caches() -> nfs4_flush_pages() -> VOP_PUTPAGE() sequence. In any case, if was_serial is always FALSE it is harmless and not related to my fix.

  2. But you would agree that your new code use this value/variable and we need to know all cases for calling of nfs4_attr_cache() before modify its behaviour.

    Nevertheless, nfs4_putpage/nfs4_putpages don't call r4_do_attrcache() or something similar that can call nfs4_attr_cache().

  3. But you would agree that your new code use this value/variable and we need to know all cases for calling of nfs4_attr_cache() before modify its behaviour.

    I'm not sure I understand your arguing. No, my code is not using was_serial.

    Nevertheless, nfs4_putpage/nfs4_putpages don't call r4_do_attrcache() or something similar that can call nfs4_attr_cache().

    If that's the case then was_serial is probably always FALSE. But I do not understand how is that related to my fix and how it affects my fix. Please elaborate.


    Also, I do not understand what is the issue you are suggesting to fix. Please be explicit what is the problem you see here. What is the scenario that could cause a problem? How do you suggest to fix the problem you see? Thanks.

  4. | I'm not sure I understand your arguing. No, my code is not using was_serial.

    | If that's the case then was_serial is probably always FALSE. But I do not understand how is that related to my fix and how it affects my fix. Please elaborate.

    There are two places:
    1.
    - if (rp->r_serial && !was_serial) {
    - klwp_t *lwp = ttolwp(curthread);
    -
    + if (rp->r_serial != NULL && !was_serial) {

    1. Code Under this condition "if (rp->r_serial != NULL && !was_serial)". Line 477,478.

    All above mean that was_serial has meaning and use that value. Therefore if was_serial has no meaning, does it mean that patch that correcting original code and adding a new code should be improved ? I suppose yes.

    | Also, I do not understand what is the issue you are suggesting to fix. Please be explicit what is the problem you see here. What is the scenario that could cause a problem? How do you suggest to fix the problem you see? Thanks.

    My intention is: new code and fixes should not introduce a new uncertainty or leave old uncertainties. Maybe there is not any issue, but to answer for sure, we need to clear understand what each variable means and what each line does. Without it the new code will be as old one. But we want to make the code better. Don't we?

    About scenario. If was_serial is always FALSE , rp->r_serial != NULL means that simultaneously another "nfs4_attr_cache" is being executed. For this case, when you remove check on "recov", does it mean that PURGE_ATTRCACHE4_LOCKED(rp) will almost always clean cached attributes and this may increase number of GETATTR requests ?

Loading...