Project

General

Profile

Actions

Bug #3916

closed

elfcore() can induce panic in bcopy()

Added by Robert Mustacchi almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Category:
kernel
Start date:
2013-07-28
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

From the Joyent bug report:

We died in bcopy() called from elfcore() (called from do_core()):

0xfffffffffb85a6ca()
do_core+0x246(ffffff19b3687940, b, 1, ffffff19c26f1dd0)
dump_one_core+0xe9(b, 7fffffffffffffff, 1, ffffff19c26f1dd0, ffffff00baa0acf0)
core+0x40b(b, 0)
psig+0x4e1()
trap+0x1408(ffffff00baa0af10, 408260, 16)
0xfffffffffb8001d6()

We died copying our lwp_curinfo:

> *panic_thread::print kthread_t t_lwp->lwp_curinfo
t_lwp->lwp_curinfo = 0xffffff96365fef50

But this seems valid:

> 0xffffff96365fef50::whatis
ffffff96365fef50 is allocated from kmem_alloc_96

The problem is that the size on the bcopy is wrong:

> elfcore+0xc0b::dis
elfcore+0xbd6:                  jmp    +0x17    <elfcore+0xbef>
elfcore+0xbd8:                  movq   %gs:0x18,%rax
elfcore+0xbe1:                  movq   0x190(%rax),%r8
elfcore+0xbe8:                  movq   0x5e0(%r8),%rdi
elfcore+0xbef:                  testq  %rdi,%rdi
elfcore+0xbf2:                  je     +0x19    <elfcore+0xc0d>
elfcore+0xbf4:                  addq   $0x8,%rdi
elfcore+0xbf8:                  leaq   0xfffffffffffffe18(%rbp),%rsi
elfcore+0xbff:                  movq   $0x100,%rdx
elfcore+0xc06:                  call   +0x3beb4dd       <bcopy>
elfcore+0xc0b:                  jmp    +0x1f    <elfcore+0xc2c>
elfcore+0xc0d:                  movq   0xfffffffffffffd08(%rbp),%r8
elfcore+0xc14:                  movzbl 0x448(%r8),%eax
elfcore+0xc1c:                  movl   %eax,0xfffffffffffffe18(%rbp)
elfcore+0xc22:                  movl   $0x7fff,0xfffffffffffffe1c(%rbp)
elfcore+0xc2c:                  movq   %gs:0x18,%rax
elfcore+0xc35:                  movq   0x190(%rax),%r8
elfcore+0xc3c:                  cmpl   $0x100000,0x4e8(%r8)
elfcore+0xc47:                  jne    +0x2d    <elfcore+0xc76>
elfcore+0xc49:                  leaq   0xfffffffffffffe18(%rbp),%rdi
elfcore+0xc50:                  leaq   -0xe0(%rbp),%rsi

Note the 0x100 as the size – but this came from the kmem_alloc_96 cache. The problem is this code:

if (sq != NULL) {
                        bcopy(&sq->sq_info, &killinfo.prk_info,
                            sizeof (killinfo.prk_info));
                } else {
                        killinfo.prk_info.si_signo = lwp->lwp_cursig;
                        killinfo.prk_info.si_code = SI_NOINFO;
                }

sq_info is a k_siginfo_t, not a siginfo_t – the only difference between the two being the excessive padding in the latter. Normally, we get lucky, and just copy some garbage into prk_info's padding, but occasionally we get unlucky enough to have a k_siginfo_t that sits immediately below an unmapped page – as in this case.

Actions #1

Updated by Robert Mustacchi almost 8 years ago

  • Status changed from New to Resolved
  • % Done changed from 90 to 100

Resolved in ea2603507f6302005d1c10815c4f0458ae3ec803.

Actions

Also available in: Atom PDF