Project

General

Profile

Bug #12827

NFSv4 client gets hard-hung in the face of high packet corruption

Added by Dan McDonald 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
nfs - NFS server and client
Start date:
Due date:
% Done:

0%

Estimated time:
Difficulty:
Hard
Tags:
Gerrit CR:

Description

Some caveats:

1.) This was found using SmartOS.
2.) This was found using ipdadm(1M) to increase chance of packet corruption to 10%.
3.) This was found using NFS client in one zone, NFS server in another on the same physical machine.
4.) The bug is more reproducible if you use ipdadm on the client.

Having said all of that, I have two coredumps showing stacks like these:

/zones/ee62f3de-bca3-6478-fb4b-feec232b1e99/root/export/home/danmcd/cores/max-hung-nfsv4-client/

The one that's hard-hung is:

> 0xfffffe886e6e73e0::findstack -v
stack pointer for thread fffffe886e6e73e0 (dd/1): fffffe00bd0229a0
[ fffffe00bd0229a0 _resume_from_idle+0x12b() ]
  fffffe00bd0229d0 swtch+0x16b()
  fffffe00bd022a00 cv_wait+0x89(fffffe00a3ec2b46, fffffffffbc61e10)
  fffffe00bd022a40 page_io_wait+0x53(fffffe00a3ec2b00)
  fffffe00bd022af0 pvn_vplist_dirty+0x40b(fffffe87176c9e00, 0, fffffffff84fd870, 0, fffffe893c4cb8e8)
  fffffe00bd022ba0 nfs4_putpages+0x342(fffffe87176c9e00, 0, 0, 0, fffffe893c4cb8e8)
  fffffe00bd022c20 nfs4_putpage+0xc1(fffffe87176c9e00, 0, 0, 0, fffffe893c4cb8e8, 0)
  fffffe00bd022c90 nfs4_putpage_commit+0xd8(fffffe87176c9e00, 0, 0, fffffe893c4cb8e8)
  fffffe00bd022d50 nfs4_close+0x2db(fffffe87176c9e00, 2302, 1, 32752, fffffe893c4cb8e8, 0)
  fffffe00bd022dd0 fop_close+0x66(fffffe87176c9e00, 2302, 1, 32752, fffffe893c4cb8e8, 0)
  fffffe00bd022e10 closef+0x97(fffffe877db69178)
  fffffe00bd022e80 closeandsetf+0x34c(4, 0)
  fffffe00bd022ea0 close+0x10(4)
  fffffe00bd022f00 _sys_sysenter_post_swapgs+0x259()
> 

And it appears to be waiting on one-or-more threads (I've two in one dump, five in another) in these less-hard-hung stacks:

stack pointer for thread fffffe00bf366c20 (nfs4_async_start()): fffffe00bf3661d0
[ fffffe00bf3661d0 _resume_from_idle+0x12b() ]
  fffffe00bf366200 swtch+0x16b()
  fffffe00bf366280 cv_timedwait_hires+0xdc(fffffe8896809050, fffffe8896809058, df8475800, f4240, 0)
  fffffe00bf366310 cv_timedwait_sig_hires+0x1d7(fffffe8896809050, fffffe8896809058, df8475800, f4240, 0)
  fffffe00bf366360 cv_timedwait_sig+0x41(fffffe8896809050, fffffe8896809058, 893932)
  fffffe00bf3664b0 clnt_cots_kcallit+0x9cb(fffffe8896809000, 1, fffffffff850f9e0, fffffe00bf366760, fffffffff850fc70, fffffe00bf366780, 3c)
  fffffe00bf3665f0 nfs4_rfscall+0x269(fffffe8b204f4000, 1, fffffffff850f9e0, fffffe00bf366760, fffffffff850fc70, fffffe00bf366780, fffffe893c4cb8e8, fffffe00bf366744, fffffe00bf366654, fffffe0000000000, fffffe877989a700)
  fffffe00bf3666c0 rfs4call+0xf9(fffffe8b204f4000, fffffe00bf366760, fffffe00bf366780, fffffe893c4cb8e8, fffffe00bf366744, 0, fffffe00bf366754)
  fffffe00bf366930 nfs4write+0x22e(fffffe87176c9e00, fffffe8c49c9d000, 18000, 8000, fffffe893c4cb8e8, fffffe00bf366a24)
  fffffe00bf366a00 nfs4_bio+0x24f(fffffe8957073240, fffffe00bf366a24, fffffe893c4cb8e8, 0)
  fffffe00bf366a90 nfs4_rdwrlbn+0xff(fffffe87176c9e00, fffffe004e66aed0, 18000, 8000, 2500, fffffe893c4cb8e8)
  fffffe00bf366b20 nfs4_sync_putapage+0x9c(fffffe87176c9e00, fffffe004e66aed0, 18000, 8000, 2400, fffffe893c4cb8e8)
  fffffe00bf366be0 nfs4_async_common_start+0x323(fffffe88dfe93680, 0)
  fffffe00bf366c00 nfs4_async_pgops_start()
  fffffe00bf366c10 thread_start+0xb()

To reproduce this bug on SmartOS:

1.) Have NFS client and NFS server on the same machine, but two distinct zones.
2.) Use ipdam(1M) on the client (though it also works with the server) to set packet corruption to 10%.

ipdadm corrupt 10

3.) Run a loop of dd(1M) of /usr/dict/words to a new file on the NFS server. Using bash, e.g.:
for i in {1..1000}; do echo $i; dd if=/usr/dict/words of=/mnt/words.$i bs=8k; done

Somewhere in the run it will hang. I've observed if ipdadm is run on the server, it will eventually complete, but after several minutes. On the client, I can get it to hang indefinitely.

During reproduction on a DEBUG kernel, I found an unrelated ASSERT failure in UDP having to do likely with the loopback path and ipdadm(1M) tripping the assertion.


Related issues

Related to illumos gate - Bug #8040: NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages()In ProgressMarcel Telka2017-04-06

Actions
#1

Updated by Dan McDonald 6 months ago

  • Related to Bug #8040: NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages() added

Also available in: Atom PDF