Project

General

Profile

Bug #7307

Fixing 5780 introduced a regression

Added by Steve Gonczi about 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
kernel
Start date:
2016-08-16
Due date:
% Done:

100%

Estimated time:
1.00 h
Difficulty:
Bite-size
Tags:
Gerrit CR:

Description

The following commit :

Author: Simon Klinkert <simon.klinkert@gmail.com>
Date:   Wed Sep 30 15:39:12 2015 +0200

    5780 Truncated coredumps

Introduces a regression. The problem is the potentially unbalenced VN_RELE(vroot)
in elf_notes.c:write_elfnotes(). This regression turns an app crash into a possible kernel crash.

    393            error = elfnote(vp, &offset, NT_FDINFO,
    394                sizeof (fdinfo), &fdinfo, rlimit, credp);
    395            VN_RELE(fvp);
***396            VN_RELE(vroot);  /* we should only do this if we are bailing out */
    397            if (error)
    398                goto done;
    399            continue;
    400        }

The fix would look like:

---            VN_RELE(vroot);  /* we should only do this if we are bailing out */
---                     if (error)
+++                  if (error) {
+++                         VN_RELE(vroot);
                    goto done;
+++                  }        
                continue;


Related issues

Related to illumos gate - Bug #5780: Truncated coredumpsClosed2015-04-02

Actions

History

#1

Updated by S. Klinkert about 4 years ago

  • Related to Bug #5780: Truncated coredumps added
#2

Updated by S. Klinkert about 4 years ago

Reproducing the described bug:

I took the reproducer from the original issue #5780 (see attachment). vn_rele() is decreasing the v_count of the vroot vnode until the vnode gets inactive and explodes a bit later. Enough coredumps produced by the reproducer will crash the system:

$  for i in {1..200}; do echo $i; ./corebug; done; echo "===> No kernel panic" 
1
Aborting now...: No such process
Abort (core dumped)
2
Aborting now...: No such process
Abort (core dumped)
3
Aborting now...: No such process
Abort (core dumped)
4
Aborting now...: No such process
Abort (core dumped)
5

...

95
Aborting now...: No such process
Abort (core dumped)
96
Aborting now...: No such process
Abort (core dumped)
97
Write failed: Broken pipe

Boom.

The stack should look like this:

Aug 18 19:27:45 host genunix: [ID 683410 kern.notice] BAD TRAP: type=d (#gp General protection) rp=ffffff0009ce0100 addr=0
Aug 18 19:27:45 host unix: [ID 100000 kern.notice]
Aug 18 19:27:45 host unix: [ID 753105 kern.notice] #gp General protection
Aug 18 19:27:45 host unix: [ID 243837 kern.notice] pid=4000, pc=0xfffffffffb85f07d, sp=0xffffff0009ce01f8, eflags=0x10206
Aug 18 19:27:45 host unix: [ID 211416 kern.notice] cr0: 8005003b<pg,wp,ne,et,ts,mp,pe> cr4: 6f8<xmme,fxsr,pge,mce,pae,pse,de>
Aug 18 19:27:45 host unix: [ID 624947 kern.notice] cr2: 8061a78
Aug 18 19:27:45 host unix: [ID 625075 kern.notice] cr3: 11c7c1000
Aug 18 19:27:45 host unix: [ID 625715 kern.notice] cr8: c
Aug 18 19:27:45 host unix: [ID 100000 kern.notice]
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     rdi: ffffff034e318c29 rsi:                0 rdx:               80
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     rcx:                2  r8:                0  r9: ffffff034d0c7740
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     rax: ffffff0350ee8b00 rbx: ffffff034e318c29 rbp: ffffff0009ce0260
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     r10: ffffff0009ce0578 r11: 4000000000000000 r12:                1
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     r13: 4000000000000000 r14:                0 r15:                0
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     fsb:                0 gsb: ffffff034ea45000  ds:               4b
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]      es:               4b  fs:                0  gs:              1c3
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]     trp:                d err:                0 rip: fffffffffb85f07d
Aug 18 19:27:45 host unix: [ID 592667 kern.notice]      cs:               30 rfl:            10206 rsp: ffffff0009ce01f8
Aug 18 19:27:45 host unix: [ID 266532 kern.notice]      ss:               38
Aug 18 19:27:45 host unix: [ID 100000 kern.notice]
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009cdffe0 unix:real_mode_stop_cpu_stage2_end+9e43 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce00f0 unix:trap+a30 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0100 unix:cmntrap+e6 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0260 unix:mutex_owner_running+d ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce02a0 zfs:zfs_znode_free+37 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0300 zfs:zfs_inactive+16a ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0360 genunix:fop_inactive+76 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0390 genunix:vn_rele+82 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce05e0 genunix:lookuppnvp+3ce ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0680 genunix:lookuppnatcred+15e ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0780 genunix:lookupnameatcred+ad ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce07d0 genunix:lookupname+38 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0990 elfexec:elf32exec+ea3 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0c40 genunix:gexec+347 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0e90 genunix:exec_common+4d5 ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0ec0 genunix:exece+1b ()
Aug 18 19:27:45 host genunix: [ID 655072 kern.notice] ffffff0009ce0f10 unix:brand_sys_syscall32+17a ()

Testing with the patch from http://cr.illumos.org/~webrev/cherimoya/7307_fix/:

$ for i in {1..200}; do echo $i; ./corebug; done; echo "===> No kernel panic" 
1
Aborting now...: No such process
Abort (core dumped)
2
Aborting now...: No such process
Abort (core dumped)
3
...

200
Aborting now...: No such process
Abort (core dumped)
===> No kernel panic

Coredumps still look like this with pfiles:

$ pfiles core.corebug.111814
core 'core.corebug.111814' of 111814:   ./corebug
  0: S_IFCHR mode:0620 dev:550,0 ino:1880641978 uid:0 gid:7 rdev:242,1
     O_RDWR|O_NOCTTY|O_LARGEFILE
     /dev/pts/1
     offset:46511
  1: S_IFCHR mode:0620 dev:550,0 ino:1880641978 uid:0 gid:7 rdev:242,1
     O_RDWR|O_NOCTTY|O_LARGEFILE
     /dev/pts/1
     offset:46511
  2: S_IFCHR mode:0620 dev:550,0 ino:1880641978 uid:0 gid:7 rdev:242,1
     O_RDWR|O_NOCTTY|O_LARGEFILE
     /dev/pts/1
     offset:46511
  7: 0x0000  mode:0000 dev:0,0 ino:0 uid:-1 gid:-1 rdev:0,0
     O_RDONLY|O_NDELAY|O_LARGEFILE FD_CLOEXEC
     offset:0

#3

Updated by Steve Gonczi about 4 years ago

Here is a kernel core resulting from the bug:
Panic message: assertion failed: vp->v_count_dnlc == 0, file: ../../common/fs/vnode.c, line: 2428
dump content: kernel pages only

$C

ffffff000cfebfc0 vpanic()
ffffff000cfec010 0xfffffffffbe0d3d4()
ffffff000cfec040 vn_free+0x150(ffffff0320998800)
ffffff000cfec070 zfs_znode_cache_destructor+0x4b(ffffff0323715e40, 0)
ffffff000cfec0d0 kmem_cache_free_debug+0x214(ffffff0320a07848, ffffff0323715e40, fffffffff78f695b)
ffffff000cfec110 kmem_cache_free+0x153(ffffff0320a07848, ffffff0323715e40)
ffffff000cfec150 zfs_znode_free+0xab(ffffff0323715e40)
ffffff000cfec190 zfs_zinactive+0xd6(ffffff0323715e40)
ffffff000cfec1f0 zfs_inactive+0x75(ffffff0320998800, ffffff03478bf9f0, 0)
ffffff000cfec250 fop_inactive+0x76(ffffff0320998800, ffffff03478bf9f0, 0)
ffffff000cfec280 vn_rele+0x82(ffffff0320998800)
ffffff000cfec850 write_elfnotes+0x850(ffffff04100e50b0, 6, ffffff0373d09900, 1d0c, fffffffffffffffd, ffffff0320836dd8, 1fff)
ffffff000cfecbc0 elfcore+0xb56(ffffff0373d09900, ffffff04100e50b0, ffffff0320836dd8, fffffffffffffffd, 6, 1fff)
ffffff000cfecc60 do_core+0x194(ffffff040f51a680, 6, 1, ffffff03236cc740)
ffffff000cfecd00 dump_one_core+0x159(6, 7fffffffffffffff, 1, ffffff03236cc740, ffffff000cfecd58)
ffffff000cfecda0 core+0x3a8(6, 0)
ffffff000cfece30 psig+0x5b4()
ffffff000cfecef0 post_syscall+0x8bd(0, ffffff03302873c0)
ffffff000cfecf00 0xfffffffffb8011f4()

An userland application is dumping core. This bug turns the userland core into a kernel core.

ffffff0320998800::print vnode_t { v_flag = 0x1 v_count = 0 v_vfsp = root v_stream = 0 v_type = 2 (VDIR) v_rdev = 0xffffffffffffffff v_vfsmountedhere = 0 v_op = 0xffffff032124e900 ... v_path = 0xffffff0322c4ea00 "/" v_rdcnt = 0 v_wrcnt = 0 ... v_count_dnlc = 0x26 }

The kernel assert complains about this vnode being released (v_count is 0) while it is still referenced in the name cache.
Other crash scenarios are possible, and the stack may or may not have write_elfnotes() in it..
If an innocent bystander is hit, the crash may happen later than the root cause dereference error.

#4

Updated by Dan McDonald about 4 years ago

Given it's a reference problem, the stacks can vary (as shown above), but these are good examples.

FOLKS WHO READ THIS IN THE FUTURE --> do not dismiss a stack like this as "this bug" without checking your own code first.

#5

Updated by Electric Monk about 4 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 69f14d1fbc2690fc894c66653f1b371783f23f46

commit  69f14d1fbc2690fc894c66653f1b371783f23f46
Author: Simon Klinkert <simon.klinkert@gmail.com>
Date:   2016-08-20T00:27:00.000Z

    7307 Fixing 5780 introduced a regression
    Reviewed by: Marcel Telka <marcel@telka.sk>
    Reviewed by: Steve Gonczi <gonczi@comcast.net>
    Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

#6

Updated by Yuri Pankov about 4 years ago

  • Description updated (diff)
  • Assignee set to S. Klinkert
  • Tags deleted (needs-triage)
#7

Updated by Yuri Pankov about 4 years ago

  • Description updated (diff)

Also available in: Atom PDF