Bug #3354
closedkernel crash in rpcsec_gss after using gsscred
100%
Description
Dear team,
we received a kernel panic, assuming that the following messages of /var/adm/messages are relevant:
Nov 13 15:36:13 wertach savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffff000da967c0 addr=10 occurred in module "rpcmod" due to a
NULL pointer dereference
Nov 13 15:36:13 wertach savecore: [ID 442558 auth.error] Panic crashdump pending on dump device but dumpadm -n in effect; run savecore(1M) manually to extract. Image UUID a373
c73c-a002-64ad-d3ac-c6b5ab0b625c.
The machine is a vmware NFS-server using krb5/ldap authentication. crash.0 is attached, whole dump can be made available. The crash appeared svereal times now.
Thanks for looking at it! It is our home-dir file server and we heavily rely on it, since it is working great so far!
Files
Related issues
Updated by Andreas Philipp about 10 years ago
echo '::status' | mdb 0
debugging crash dump vmcore.0 (64-bit) from wertach
operating system: 5.11 oi_151a7 (i86pc)
image uuid: a373c73c-a002-64ad-d3ac-c6b5ab0b625c
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff000da967c0 addr=10 occurred in module
"rpcmod" due to a NULL pointer dereference
dump content: kernel pages only
Updated by Rich Lowe about 10 years ago
It'd be good to see the stack trace
($C in mdb) also, and possibly to make the crash dump available somehow as well.
Updated by Andreas Philipp about 10 years ago
Like this?
echo '$C' | mdb 0
ffffff000da968c0 rmm_wput+0x1a(ffffff02a1b92138, ffffff029a1269e0)
ffffff000da96930 put+0x1aa(ffffff02a1b92138, ffffff029a1269e0)
ffffff000da969a0 svc_cots_ksend+0x1de(ffffff02addd6400, ffffff000da969b0)
ffffff000da96a20 svc_sendreply+0x59(ffffff02addd6400, fffffffff81ac8c8,
ffffff000da96ab0)
ffffff000da96b40 do_gss_accept+0x3fd(ffffff02addd6400, ffffff02f207e0e0,
ffffff02a86f2378, ffffff03802c02f8, 1, 1)
ffffff000da96b70 svcrpcsec_gss_taskq_func+0x38(ffffff0342775680)
ffffff000da96c20 taskq_thread+0x285(ffffff0295e97378)
ffffff000da96c30 thread_start+8()
Is it possible to send a link to vdump.0 via personal message? And if yes, how?
Updated by Andreas Philipp about 10 years ago
Here is a link to vdump.0 (1.4Gb, should do it):
Updated by Marcel Telka about 10 years ago
553rmm_wput(queue_t *q, mblk_t *mp) 554{ 555 (*((struct temp_slot *)q->q_ptr)->ops->xo_wput)(q, mp); 556} > ffffff02a1b92138::whatis ffffff02a1b92138 is ffffff02a1b92040+f8, freed from queue_cache >
Updated by Marcel Telka about 10 years ago
- Status changed from New to Feedback
Please try to reproduce the issue with kmem_flags=0xf. Please read http://illumos.org/books/mdb/kmem-1.html for more details.
Updated by Marcel Telka about 10 years ago
- Status changed from Feedback to In Progress
Updated by Jan Veninga almost 10 years ago
We have the same issue. The machine is also running on vmware. NFS server with kerberos authentication. If more info is needed, please let me know.
Updated by Matthias Frey over 9 years ago
Hi
We have the same issue. We are serving the home directories via nfs/kerberos.
A dump is available here:
Updated by Piotr Jasiukajtis over 9 years ago
Please share your kernel version and panic stack trace.
Updated by Matthias Frey over 9 years ago
>$C ffffff001feb7870 rmm_wput+0x1a(ffffff04f2588ba0, ffffff0501070220) ffffff001feb78e0 put+0x1b5(ffffff04f2588ba0, ffffff0501070220) ffffff001feb7950 svc_cots_ksend+0x8f(ffffff04ffdf0200, ffffff001feb7960) ffffff001feb79d0 svc_sendreply+0x57(ffffff04ffdf0200, fffffffff8305e40, ffffff001feb7a70) ffffff001feb7b30 do_gss_accept+0x349(ffffff04ffdf0200, ffffff05305e2a20, ffffff05060c09c8, ffffff0530751bb8, 1, 1) ffffff001feb7b60 svcrpcsec_gss_taskq_func+0x2f(ffffff050c910e78) ffffff001feb7c20 taskq_thread+0x2d0(ffffff04f21f5178) ffffff001feb7c30 thread_start+8()
uname -a:
SunOS iai-fileserv1 5.11 omnios-b281e50 i86pc i386 i86pc
The Server is running OmniOs at the moment. But we have this Problem also with Nexenta and Openindiana.
Updated by Piotr Jasiukajtis over 9 years ago
Are you able to reproduce this with the oi_151a8 OpenIndiana version?
Updated by Stefan Skoglund about 9 years ago
I'm seeing this too, and this is also a vmware guest.
----
SunOS san2 5.11 oi_151a8 i86pc i386 i86pc Solaris
----
Updated by Marcel Telka over 6 years ago
- Has duplicate Bug #7342: kernel crash in rpcsec_gss after using gsscred added
Updated by Marcel Telka about 6 years ago
- Subject changed from kernel panic due to rpcmod? to kernel crash in rpcsec_gss after using gsscred
Updated by Paul Henson about 5 years ago
We just hit this in the latest omnios CE:
- cat /etc/release
OmniOS v11 r151022ac
ffffff0011d54870 rmm_wput+0x1a(ffffff0477fe5670, ffffff03dd6c7b40)
ffffff0011d548e0 put+0x1b5(ffffff0477fe5670, ffffff03dd6c7b40)
ffffff0011d54950 svc_cots_ksend+0x9e(ffffff047644f000, ffffff0011d54960)
ffffff0011d549d0 svc_sendreply+0x57(ffffff047644f000, fffffffff8579e40, ffffff0011d54a70)
ffffff0011d54b30 do_gss_accept+0x443(ffffff047644f000, ffffff0464b24870, ffffff30f4a17c80, ffffff1a23f3cea0, 1, 1)
ffffff0011d54b60 svcrpcsec_gss_taskq_func+0x2f(ffffff0423cd9830)
ffffff0011d54c20 taskq_thread+0x2d0(ffffff03eddec390)
ffffff0011d54c30 thread_start+8()
Longstanding bug 8-/; Marcel, any chance you'd have time to take a look at it again and finish the fix you mentioned in the duplicate ticket :)? I can provide a core dump if you'd like.
Thanks...
Updated by Marcel Telka over 4 years ago
- Has duplicate Bug #9563: Panic with kerberized NFS and GSS added
Updated by Marcel Telka over 4 years ago
I received this today from Santosh Pradhan:
On Tue, May 29, 2018 at 8:35 PM, Marcel Telka <marcel@telka.sk> wrote: > Hi Santosh, > > On Tue, May 29, 2018 at 08:32:17PM +0530, Santosh Pradhan wrote: > > Hi Marcel, > > This is Santosh again. Sorry to disturb you. Just wanted to let you know > > that this issue I have fixed in our branch. I can share the one liner if > > you want. > > Feel free to propose your fix upstream, but I doubt the proper fix is just > oneliner. IIRC the problem (and the fix) is more complex. > Could be, but as per my understanding it should be simple one and the fix works for all the use cases we tested. The stacktrace looks similar, here was my analysis: [root@meeren:meeren]# mdb 0 Loading modules: [ unix genunix dtrace specfs mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt_sas3 sd ip hook neti sockfs arp usba uhci stmf_sbd stmf md lofs random idm crypto nfs ipc nsmb smbsrv ptm ufs logindmux ] > $C ffffff005e1089a0 svc_cots_ksend+0x1e1(ffffff0dcade6800, ffffff005e1089b0) ffffff005e108a20 svc_sendreply+0x59(ffffff0dcade6800, fffffffff821e8c8, ffffff005e108ab0) ffffff005e108b40 do_gss_accept+0x3fd(ffffff0dcade6800, ffffff12d1e49f80, ffffff105b172a58, ffffff0e85125ad8, 1, 1) ffffff005e108b70 svcrpcsec_gss_taskq_func+0x38(ffffff0eb11ea658) ffffff005e108c20 taskq_thread+0x285(ffffff0da63d1360) ffffff005e108c30 thread_start+8() > > ffffff0dcade6800::print -ta SVCXPRT xp_master ffffff0dcade6858 SVCMASTERXPRT *xp_master = 0xffffff0dcaea3448 > 0xffffff0dcaea3448::print -ta SVCMASTERXPRT xp_xpc.xpc_wq ffffff0dcaea3468 queue_t *xp_xpc.xpc_wq = 0 <==== xprt is gone means wq is already closed in master xprt > ffffff0dcade6800::print -ta SVCXPRT xp_xpc.xpc_wq ffffff0dcade6810 queue_t *xp_xpc.xpc_wq = 0xffffff0da66aab90 But clone xprt is still\pointing to stale one causing the corruption. The gss security is handled in different thread (taskq) i.e. svcrpcsec_gss_taskq_func() and NFS RPC reqs are handled in different thread i.e. svc_run() [created in svc_queuereq()]. Both can execute independent of each other, they know nothing about each other. If svc_run()->svc_getreq()->sec_svc_msg() finishes first [as gss sec is handled in thread, rpcsec_gss_init() can return AUTH_OK if it could dispatches the task into taskq], then svc_run() ll releases it slot (fetched one mblk_t msg from STREAMS wq) and if there are no more messages that queue might get closed which is happened in this case. > 0xffffff0da66aab90::print -ta queue_t q_flag ffffff0da66aabc8 uint_t q_flag = 0x64002 #define QWCLOSE 0x00020000 /* will not be enabled */ > ffffff0dcade6800::print -ta SVCXPRT xp_master ffffff0dcade6858 SVCMASTERXPRT *xp_master = 0xffffff0dcaea3448 > 0xffffff0dcaea3448::print -ta SVCMASTERXPRT xp_threads xp_xpc.xpc_wq ffffff0dcaea34d0 int xp_threads = 0x1 ffffff0dcaea3468 queue_t *xp_xpc.xpc_wq = 0 xp_threads is 1 as svcrpcs_gss_taskq_func() links with this xprt. But queue is gone. svcrpcsec_gss_taskq_func() still access the same queue which was closed. > 0xffffff0da66aab90::whatis ffffff0da66aab90 is ffffff0da66aaa98+f8, freed from queue_cache This is just a race between these threads. I believe svc_run() [svc_getreq() invokes sec_svc_msg()], should wait for svcrpcsec_gss_taskq_func() thread to finish. i.e. rpcsec_gss_init() should wait for the task to finish (ddi_taskq_wait()) before returning AUTH_OK so that queue would not be closed. The patch was: diff --git a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c index 23f98b9..ccbdd16 100644 --- a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c +++ b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c @@ -1000,6 +1000,7 @@ rpcsec_gss_init( } mutex_exit(&client_data->clm); + ddi_taskq_wait(svcrpcsec_gss_init_taskq); *no_dispatch = TRUE; return (AUTH_OK); Let me know what do you think. Best R, Santosh
Updated by Marcel Telka over 4 years ago
And here is my reply:
On Tue, May 29, 2018 at 08:57:27PM +0530, Santosh Pradhan wrote: > diff --git a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > index 23f98b9..ccbdd16 100644 > --- a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > +++ b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > @@ -1000,6 +1000,7 @@ rpcsec_gss_init( > } > > mutex_exit(&client_data->clm); > + ddi_taskq_wait(svcrpcsec_gss_init_taskq); > *no_dispatch = TRUE; > return (AUTH_OK); > > > Let me know what do you think. Ah, okay. You simply made svcrpcsec_gss_taskq_func() synchronous to rpcsec_gss_init(). This obviously fixes the problem but with possible performance penalty. I'm not sure how serious the performance degradation is (or could be) but this definitely needs some measurement to be sure it is acceptable. If this fix is acceptable, then you can just call svcrpcsec_gss_taskq_func() directly and do not need the tasq for it. This would make your fix a bit wider, but simplify the code significantly. OTOH, I was thinking about fixing this without any possible performance impact, but that would require more complex change.
Updated by santosh pradhan over 4 years ago
Again rpcsec_gss_init() gets called only during mount and not in the data path (e.g. rpcsec_gss_data()). So, it would not impact performance as per my understanding.
Updated by Marcel Telka over 4 years ago
santosh pradhan wrote:
Again rpcsec_gss_init() gets called only during mount and not in the data path (e.g. rpcsec_gss_data()). So, it would not impact performance as per my understanding.
Sure, but in some scenarios even a bit slower mount could have significant performance impact. In any case, I always prefer correct (even maybe possible slower) code, over faster, but buggy code, so I've no problem with your fix. :-)
Updated by Andy Fiddaman over 4 years ago
We've had this reported in OmniOS LTS again after we integrated KPTI.
KPTI is not at fault but appears to have changed timings enough in places to trigger the problem more often - at least for the customer who reported it.
Updated by Andy Fiddaman over 4 years ago
- Assignee set to Andy Fiddaman
- % Done changed from 0 to 80
- Tags deleted (
needs-triage)
Updated by Andy Fiddaman over 4 years ago
- Status changed from In Progress to Pending RTI
- % Done changed from 80 to 100
Updated by Electric Monk over 4 years ago
- Status changed from Pending RTI to Closed
git commit 4a3b05278938491ea95d557939c130d3eb17cfd4
commit 4a3b05278938491ea95d557939c130d3eb17cfd4 Author: Andy Fiddaman <omnios@citrus-it.co.uk> Date: 2018-08-29T17:20:11.000Z 3354 kernel crash in rpcsec_gss after using gsscred Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Carlos Neira <cneirabustos@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com>