7378 exported_lock held during nfs4 compound processing

Review Request #813 - Created Jan. 12, 2018 and updated

Information
Marcel Telka
illumos-gate
master
7378
5b4e200...
Reviewers
general
Traditionally, rfs4_compound() holds a read lock on exported_lock during the
processing of the compound request. The processing of some OPs might take
several seconds. In our case, writes to a zfs filesystem can take very long if
the filesystem is at quota. In this case, zfs waits for the next txg to free
some space. This becomes problematic should someone need a write lock, like
share(). If there is a pending writer, it has to wait for all existing readers
to complete. During this time, all new readers also have to wait, effectively
rendering nfs4 service unresponsive.

The aim is to only acquire the exported_lock during nfs4 processing when
needed.

  
Marcel Telka
Marcel Telka
Review request changed

Change Summary:

Implemented changes I suggested here: https://illumos.topicbox.com/groups/developer/T38028833ac394eda

Commit:

-c79bf83eb8720b601919e608cec3ae28f42d927a
+5b4e2003640a22e547b6b7b6ff30a30231a6055b

Diff:

Revision 2 (+248 -153)

Show changes

Vitaliy Gusev

One thing I am concerned - performance metrics. Each PUTFH and LOOKUP (and some others) will
produce twice mutex hold and mutex release - 4 atomic operations. And NFS4 can have a lot of
operations in a request. Additionally note, rw-lock is very heavy (in comparision to plain mutex).

I suppose after-patch behaviour was not implemented intentionally
and original comment has a piece of description for that:

 "since NFS4 breaks fundamental assumptions in the exi_count design"
  1. I would suggest to re-think about usage exported_lock and move away the current usage design to eliminate mutex/rwlock bottlenecks/races.

    The new approach can be as RCU or something similar then fast-path (server) atomatically abtains pointer to a Shared-Tree, while slow-path (configuration) prepares the new-shared-tree and atomically replaces pointer to Shared-Tree.

    However, we need keep in mind in this new approach share/unshare must be as "barrier", otherwise unshared dataset could be accessible for a while. In this case "wait" flag can be added to share/unshare calls to ensure "barrier" behaviour.

  2. One thing I am concerned - performance metrics. Each PUTFH and LOOKUP (and some others) will
    produce twice mutex hold and mutex release - 4 atomic operations. And NFS4 can have a lot of
    operations in a request.

    vn_hold()/vn_rele() does mutex_enter()/mutex_exit() and nobody is concerned about performance there. Are you really concerned that few added exi_hold()/exi_rele() are worse than exported_lock held during the whole compound? What is your concern based on?

    Additionally note, rw-lock is very heavy (in comparision to plain mutex).

    An successfull attempt to hold rw lock as reader is 11 instructions, while successfull mutex_enter() is 5 instructions. Where do you see a possibility that this difference could cause any problem?

    I suppose after-patch behaviour was not implemented intentionally
    and original comment has a piece of description for that:

    "since NFS4 breaks fundamental assumptions in the exi_count design"

    Sorry, I do not understand what you wanted to say here. Please elaborate. Thanks.

    I would suggest to re-think about ... [snip]

    Maybe later as a new project, but definitely not here and now. Sorry.

  3. vn_hold()/vn_rele() does mutex_enter()/mutex_exit() and nobody is concerned about performance there. Are you really concerned that few >added exi_hold()/exi_rele() are worse than exported_lock held during the whole compound? What is your concern based on?

    I just noted that performance after-this-patch could be worse than before. Of course, vn_hold/vn_rele is not perfect, but it is not reason to introduce the new performance bottleneck.

    Holding vn_hold/vn_rele() is not the same as holding exported_lock because vn_hold/vn_rele operates on different mutex-es from different vnode-s, whereas holding exported_lock is very racy due to operates on the same rwlock.

    Holding the same rwlock/mutex is point for performance degradation due to instructions with LOCK are run on the same cache-line-aligned address. Holding different mutex-es/rwlock-s implies LOCK on different cache-line-aligned addresses.

    BTW, vn_hold/vn_rele can be improved by replacing mutex with atomic refcounter.

    An successfull attempt to hold rw lock as reader is 11 instructions, while successfull mutex_enter() is 5 instructions.
    Where do you see a possibility that this difference could cause any problem?

    It is not related to number of instruction and related to implementation and origin of krwlock. In brief, rw_enter(&m, READER) on one CPU conflicts with rw_enter(&m, READER) on another CPU. Running rw_enter(READER) on many CPUs significant increases conflicts and looping within rw_enter(READER).

    Sorry, I do not understand what you wanted to say here. Please elaborate. Thanks

    I suppose there could be reason why Sun's engineers implemented
    the exported_lock holding in such way and I copy-pasted comment:

     /* ...
     * NFS4
     * rfs4_compound holds exported_lock for duration of compound
     * processing.  This version doesn't manipulate exi_count
     * since NFS4 breaks fundamental assumptions in the exi_count
     * design.
     */
    

    And it would be good to know whether they were right or wrong while were writing this comment.

    Maybe later as a new project, but definitely not here and now. Sorry.

    I insist that changing design of usage exported-tree (and exported_lock) must be before or instead of this patch.

Loading...