Project

General

Profile

Bug #3489

panic in smbsrv:smb_mbc_vdecodef

Added by David Scharbach over 7 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
2013-01-19
Due date:
% Done:

20%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage
Gerrit CR:

Description

My OI system is hanging and requires a hard boot. This happens between 1 and 20 days. Since October, the system has never had more than 20 days of uptime.

System is:
i3 3220
Asus P8Z77-v LK
32GB RAM
80GB (mirror) OS SSD
13 3TB Seagate 7200RM SATA drives
LSI SAS2008 controller
Intel SAS expander
Intel NIC (recent install, disabled on board NIC)

I attached my crash.0 file.

Cheers,

Dave


Files

crash.0 (868 KB) crash.0 David Scharbach, 2013-01-19 10:35 PM
crash.0 (868 KB) crash.0 David Scharbach, 2013-02-17 12:45 AM
crash.1.feb.24.13 (868 KB) crash.1.feb.24.13 David Scharbach, 2013-02-24 02:39 PM

History

#1

Updated by Sašo Kiselkov over 7 years ago

This appears to be an internal problem in the smbsrv module (the in-kernel smb server). The failure is a bad trap (memory fault) at smb_mbc_vdecodef+0xb3. This appears to be a case of bad input handling, a quick glance through gdb points this address to smb_mbuf_marshaling.c:360, though that's probably off (since that line only contains an ASSERT which should cause a panic, not a bad trap).

Dave, can you provide access to the compressed vmdump.0 file by uploading it somewhere? Please note that this may contain sensitive kernel memory, so I recommend encrypting it ("gpg -c vmdump.0 > vmdump.0.gpg") and sending the passphrase only via direct e-mail.

#2

Updated by David Scharbach over 7 years ago

I can send a secure file link to those who need the dump file.

And could one of the admins change the name of this issue to something more useful? My fault that the title is lame and I cannot change it.

Cheers,

#3

Updated by David Scharbach over 7 years ago

File sent to Saso.

Cheers,

#4

Updated by David Scharbach over 7 years ago

Here is some of the information on the mailing list discussion from my server.

$ fmdump -m
SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Jan 17 20:08:28 CST 2013
PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
DESC: The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/openindiana.
IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial.
REC-ACTION: If savecore is not enabled then please take steps to preserve the crash image.
Use 'fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6' to view more panic detail. Please refer to the knowledge article for additional information.

With the extended info:

$ fmdump -Vp -u 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
TIME UUID SUNW-MSG-ID
Jan 17 2013 20:08:28.919350000 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6 SUNOS-8000-KL

TIME                 CLASS                                 ENA
Jan 17 20:08:28.9139 ireport.os.sunos.panic.dump_available 0x0000000000000000
Jan 17 20:08:07.5900 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
version = 0x0
class = list.suspect
uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
code = SUNOS-8000-KL
diag-time = 1358474908 917149
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list0)
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
resource = sw:///:path=/var/crash/openindiana/.809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
savecore-succcess = 1
dump-dir = /var/crash/openindiana
dump-files = vmdump.0
os-instance-uuid = 809adc23-290c-c3bb-bcde-c3d4c5c1ebe6
panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff003c913840 addr=77 occurred in module "smbsrv" due to a NULL pointer dereference
panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () | smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () | smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () | genunix:taskq_d_thread+b1 () | unix:thread_start+8 () |
crashtime = 1358409705
panic-time = January 17, 2013 02:01:45 AM CST CST
(end fault-list0)

fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x50f8ae9c 0x36cc2af0

On 2013-01-19 23:50, Aurélien Larcher wrote:
I cannot tell what would be the next step to diagnose the problem but:

panicstr = BAD TRAP: type=e (#pf Page fault) rp=ffffff003c913840 addr=77
occurred in module "smbsrv" due to a NULL pointer dereference
panicstack = unix:die+dd () | unix:trap+17db () | unix:cmntrap+e6 () |
smbsrv:smb_mbc_vdecodef+b3 () | smbsrv:smb_mbc_decodef+98 () |
smbsrv:smb_dispatch_request+ca () | smbsrv:smb_session_worker+6c () |
genunix:taskq_d_thread+b1 () | unix:thread_start+8 () |

looks like a good start would be to look if there is any bug filed
concerning Samba...

I believe, this would not be Samba (a userspace project) but Solaris
kernel implementation of CIFS, server in this case.

Causes might be varied, but if there is integration with the Windows
network (MSAD domain), it might be one thing worth researching - if
the user account mapping (mapid, PAM), kerberos login of the server to
domain, naming services and such pieces don't log errors of their own...
Not that these SHOULD cause kernel panics, but who knows what the module
can do if fed invalid inputs? ;) - and these you might be warned about
before the crash...

//Jim

It looks like the in-kernel smb server crashed, notably in
smb_mbc_vdecodef+b3. Now your version might not correspond exactly to
mine, but from what I can glean, it appears you hit an ASSERT there -
something happened that the software maker didn't expect. Can you send
me a gzip'ped version of your /usr/kernel/drv/amd64/smbsrv? That's the
in-kernel SMB server module. I'd like to verify whether offset +0xb3
corresponds to the same spot in your function.

Cheers,
--
Saso

#5

Updated by Sašo Kiselkov over 7 years ago

The bug occured at smb_mbc_vdecodef+0xb3, which, if I read the assembly correctly, corresponds to smb_mbuf_marshalling.c:175. It is somewhat strange that the memory fault occured at the "cmp $0x54, %eax" instruction, since that doesn't access memory at all (it's a direct integer compare). I built a debug version of the relevant object file using Studio 12.1 and used GDB to cross-check the line correspondence to smb_mbuf_marshaling.c:175:

(gdb) disassemble /m smb_mbc_vdecodef
Dump of assembler code for function smb_mbc_vdecodef:
..[snip]..
175 switch (c) {
0x00000000000000b4 <+172>: lea -0x25(%r13),%eax
0x00000000000000b8 <+176>: cmp $0x54,%eax
0x00000000000000bb <+179>: ja 0x79f <smb_mbc_vdecodef+1943>
0x00000000000000c1 <+185>: movslq %eax,%rax
0x00000000000000c4 <+188>: jmpq *0x0(,%rax,8)

This corresponds quite closely to what's going on in the crash dump near that address:
smb_mbc_vdecodef+0xae: leal -0x25(%r12),%eax
smb_mbc_vdecodef+0xb3: cmpl $0x54,%eax
smb_mbc_vdecodef+0xb6: ja +0x6cd <smb_mbc_vdecodef+0x789>
smb_mbc_vdecodef+0xbc: movslq %eax,%r8
smb_mbc_vdecodef+0xc7: movl (%r14),%eax

So it appears as if the load at smb_mbc_vdecodef+0xae might be the culprit - "fmt" appears to point to a bad address. A quick glance at ::regs in mdb reveals:
%r12 = 0x0000000000000077
There's something obviously rotten here, as -0x25(%r12) would point to inaccessible memory (address 0x52).

At this point, I'm pretty much out of ideas. I'm quite poor at assembly debugging, and this goes way over my head. Might some of the smarter gurus weight in please?

#6

Updated by Sašo Kiselkov over 7 years ago

Just a quick follow-up, the exact ::stack (with line numbers where I know them) to this is:
smb_mbc_vdecodef+0xb3(ffffff092ee273d0, fffffffff8290a80, ffffff003c913a50) [at: smb_mbuf_marshaling.c:175]
smb_mbc_decodef+0x98() [just a va_args wrapper function]
smb_dispatch_request+0xca(ffffff092ee27340) [at: smb_dispatch.c:542]
smb_session_worker+0x6c(ffffff092ee27340)
taskq_d_thread+0xb1(ffffff092dc70c80)

#7

Updated by Sašo Kiselkov over 7 years ago

Scrap my previous tangent on the "leal" causing a memory fault, it just does arithmetic. Since %r12 = 0x77, after the leal the value of %eax = 0x52. This then gets compared to 0x54 to decide whether to enter the switch at all. Which means, %r12 contains the values from *fmt. A look above shows that %r13 contains the fmt pointer itself, which in turn right now points to the 7th character of the format string "Mbbbwbww8c2.wwww" (the one right before the character '8').
I'm too stupid to put the whole picture together yet, but I'm making slow progress in at least re-learning what I forgot about assembly.

#8

Updated by Sašo Kiselkov over 7 years ago

After a day of looking at the issue, I feel I have absolutely no idea what's going on here. From the looks of it, %r12=0x77, which is oddly similar to the fault address which induced the panic (addr=77), but it doesn't make any sense for why "cmpl $0x54,%eax" should induce a memory fault at all, since it doesn't operate on memory (well, not in the direct sense anyway). This panic would entirely make sense if the instruction were using %eax as a memory offset as in "cmpl $0x54, (%eax)", but then the kernel would crash every time it came through here.

David, I need you to reproduce this issue more than once and check whether it occurs always in the same place - just let the machine run and once it goes down, again save a new crash dump a compare the ::panicinfo and ::stack commands from your previous dump. If it occurs in the same place, then it's obviously an algorithmic error occurring on corrupt or unexpected input. If it changes, we've still got the possibility of a race condition (unlikely given the nature of the faulting instruction) or, more likely, faulty memory. I presume you don't use ECC RAM (Core i3 I believe don't support it), but have you tried memtest? (Just to rule out one possibility.)

#9

Updated by Sašo Kiselkov over 7 years ago

  • Status changed from New to Feedback

Setting to feedback, this needs more info to correctly assess what's going on. So far it doesn't make much sense to me.

#10

Updated by David Scharbach over 7 years ago

  • % Done changed from 0 to 20

I do not use ECC. I ran 3 passes on memtest86 and everything passed with flying colours.

I will capture the next crash when it happens. Uptime of 2+ days now. Could be tomorrow or a while.

Thank you again for the help.

Dave

#11

Updated by Albert Lee over 7 years ago

Added gwr as a watcher, as I suspect he might be interested.

#12

Updated by Gordon Ross over 7 years ago

  • Subject changed from Unexpected system hang to panic in smbsrv:smb_mbc_vdecodef

There was a somewhat similar panic in #1761 - does your system have that fix?
Or this could be some other case of a short message or something.

If you can make the crash dump available to me, I'll look when I can.

#13

Updated by Albert Lee over 7 years ago

I'm uploading a copy of the dump to beast.ma. It'll be ~trisk/vmdump.3489 on beast.

Alternatively, http://trisk.deadgerbil.com/ws/vmdump.3489

#14

Updated by David Scharbach over 7 years ago

I am sorry I am such a n00b but I cannot answer if I do have the fixes from #1761. I have installed all of the updates available from the update manager. If #1761 is included in those updates then yes. If not, could you please let me know.

Thank you again for all of your help.

Cheers,

Dave

#15

Updated by Sašo Kiselkov over 7 years ago

David, have you seen this bug crop up again? Just checking in so it doesn't get forgotten.

#16

Updated by David Scharbach over 7 years ago

Fortunately and unfortunately no... I will post as soon as an issue happens.

Cheers,

Dave

#17

Updated by David Scharbach over 7 years ago

So, up 29 days and counting. It is both good and bad. Murphy loves me. The anticipation of a crash is worse than the crash itself.

Cheers,

Dave

#18

Updated by Sašo Kiselkov over 7 years ago

Such is life - just when you need a crash, it doesn't happen.

Try to pour some load onto the system, if you can. Don't be gentle to it, smack it around some.

#19

Updated by David Scharbach over 7 years ago

Here is another crash dump. Strange date on it as I was up for 29 days as of today when it crashed.

Let me know if I can provide more information.

Thank you!!

Dave

#20

Updated by David Scharbach over 7 years ago

I had another crash. I figured my stuff out better and realized I uploaded two of the same crash files. Sorry. Crash.1 is for February 24.

This is the latest fmdump -m info:

SUNW-MSG-ID: SUNOS-8000-KL, TYPE: Defect, VER: 1, SEVERITY: Major
EVENT-TIME: Sun Feb 24 07:52:20 CST 2013
PLATFORM: System-Product-Name, CSN: System-Serial-Number, HOSTNAME: openindiana
SOURCE: software-diagnosis, REV: 0.1
EVENT-ID: 0780cba1-6728-c485-f63a-f2065aa89515
DESC: The system has rebooted after a kernel panic. Refer to http://illumos.org/msg/SUNOS-8000-KL for more information.
AUTO-RESPONSE: The failed system image was dumped to the dump device. If savecore is enabled (see dumpadm(1M)) a copy of the dump will be written to the savecore directory /var/crash/openindiana.
IMPACT: There may be some performance impact while the panic is copied to the savecore directory. Disk space usage by panics can be substantial.
REC-ACTION: If savecore is not enabled then please take steps to preserve the crash image.
Use 'fmdump -Vp -u 0780cba1-6728-c485-f63a-f2065aa89515' to view more panic detail. Please refer to the knowledge article for additional information.

and with extended information:

  1. fmdump -Vp -u 0780cba1-6728-c485-f63a-f2065aa89515
    TIME UUID SUNW-MSG-ID
    Feb 24 2013 07:52:20.544628000 0780cba1-6728-c485-f63a-f2065aa89515 SUNOS-8000-KL

    TIME CLASS ENA
    Feb 24 07:52:20.5355 ireport.os.sunos.panic.dump_available 0x0000000000000000
    Feb 24 07:51:58.1653 ireport.os.sunos.panic.dump_pending_on_device 0x0000000000000000

nvlist version: 0
version = 0x0
class = list.suspect
uuid = 0780cba1-6728-c485-f63a-f2065aa89515
code = SUNOS-8000-KL
diag-time = 1361713940 539283
de = fmd:///module/software-diagnosis
fault-list-sz = 0x1
fault-list = (array of embedded nvlists)
(start fault-list0)
nvlist version: 0
version = 0x0
class = defect.sunos.kernel.panic
certainty = 0x64
asru = sw:///:path=/var/crash/openindiana/.0780cba1-6728-c485-f63a-f2065aa89515
resource = sw:///:path=/var/crash/openindiana/.0780cba1-6728-c485-f63a-f2065aa89515
savecore-succcess = 1
dump-dir = /var/crash/openindiana
dump-files = vmdump.1
os-instance-uuid = 0780cba1-6728-c485-f63a-f2065aa89515
panicstr = BAD TRAP: type=d (#gp General protection) rp=ffffff003cae2d90 addr=ffffff08f4e7fc80
panicstack = unix:die+100 () | unix:trap+43e () | unix:cmntrap+e6 () | unix:rdmsr+2 () | unix:speedstep_pstate_transition+4f () | unix:xc_serv+1d0 () | unix:av_dispatch_autovect+7c () | unix:dispatch_hilevel+1f () | unix:switch_sp_and_call+13 () | unix:do_interrupt+f2 () | unix:cmnint+ba () | unix:acpi_cpu_cstate+2ae () | unix:cpu_acpi_idle+82 () | unix:cpu_idle_adaptive+19 () | unix:idle+114 () | unix:thread_start+8 () |
crashtime = 1361659132
panic-time = February 23, 2013 04:38:52 PM CST CST
(end fault-list0)

fault-status = 0x1
severity = Major
__ttl = 0x1
__tod = 0x512a1b14 0x20765d20

If you need the vmcore.1 file, please let me know and I will make it available.

Any help is greatly appreciated.

Thank you,

Dave

#21

Updated by Gordon Ross over 7 years ago

I looked at the earlier crash dump (where it panics in smb_vdecodef)

.CG56.6147: .L77001357:
    leal       -37(%r12),%eax            ;/ line : 175
    cmpl       $84,%eax                ;/ line : 175

This is the compiler doing arithmetic on r12, which is 'w' (0x77).
Nothing unusual here, except the page fault trap. This "leal"
instruction does not reference that address, so something is
very strange here. Otherwise, the data structures look OK.

The above combined with the fact that other crashes from
this machine are in completely different places lead me to
suspect something at a lower level going wrong, i.e.
hardware (emulator?) errors, or something...

#22

Updated by David Scharbach over 7 years ago

Thanks for the help. I am up for 28 days this time. Only change was that I updated to Napp-it V0.9x.

We will see what we will see.

Cheers,

#23

Updated by David Scharbach over 7 years ago

Well, it crashed again. Yippee.

I may need to start using another operating system it seems.

Thanks for the help so far. If you have any other ideas, please let me know. If not, I will likely give FreeNAS another shot. Sad to do that because OI seemed perfect for what I am using it for. Given this is a home system, not really interested in changing out hardware as I have too much money tied up in this box as it is :P

Cheers,

#24

Updated by David Scharbach over 7 years ago

And the awesome adventure continues... OI V28 ZFS is really not V28 due to the Illumos tweaks for async stuff. That means it will take a wee bit more time to try out FreeNAS for stability. Oh how I wish OI was stable for me!

Cheers,

#25

Updated by David Scharbach over 7 years ago

FreeNAS 8.3 is up and running. I miss the abilities of OI but it has not crashed after the first 5 days so fingers crossed.

Thank you all for the help and I will likely re-try OI in the future. Keep up the good work.

Cheers,

Dave

#26

Updated by Gordon Ross almost 5 years ago

  • Status changed from Feedback to Closed

This was almost certainly #1761 (which fixed a problem where the message parsing code might run off the end of the message chain).

Also available in: Atom PDF