Project

General

Profile

Actions

Bug #13926

closed

core files can fail to dump leading large sections

Added by Robert Mustacchi 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While working on #13925 I noticed something weird when taking a core file (via pkill -ABRT) of a rust program. In this case I was running the debug build of the dropshot basic example. There were two section headers that notably differed from their gcore brethren:

$ elfdump -c core.dropshot.abrt 

Section Header[1]:  sh_name: .symtab
    sh_addr:      0x400000            sh_flags:   0
    sh_size:      0                   sh_type:    [ SHT_SYMTAB ]
    sh_offset:    0                   sh_entsize: 0x18 (0 entries)
    sh_link:      2                   sh_info:    38512
    sh_addralign: 0x8               

Section Header[2]:  sh_name: .strtab
    sh_addr:      0x400000            sh_flags:   [ SHF_STRINGS ]
    sh_size:      0                   sh_type:    [ SHT_STRTAB ]
    sh_offset:    0                   sh_entsize: 0
    sh_link:      0                   sh_info:    0
    sh_addralign: 0x1               

If you look at the version from gcore it looks more like the following:

Section Header[1]:  sh_name: .symtab
    sh_addr:      0x400000            sh_flags:   0
    sh_size:      0x206dc0            sh_type:    [ SHT_SYMTAB ]
    sh_offset:    0x1c3f37c           sh_entsize: 0x18 (88552 entries)
    sh_link:      2                   sh_info:    38512
    sh_addralign: 0x8               

Section Header[2]:  sh_name: .strtab
    sh_addr:      0x400000            sh_flags:   [ SHF_STRINGS ]
    sh_size:      0x5c29ed            sh_type:    [ SHT_STRTAB ]
    sh_offset:    0x1e4613c           sh_entsize: 0
    sh_link:      0                   sh_info:    0
    sh_addralign: 0x1               

Staring at the code and doing a bunch of experiments, I hypothesized that this would happen if we failed to do something correctly in copy_scn(). I was able to prove this hypothesis with the following bit of DTrace:

$ pfexec dtrace -n 'fbt::copy_scn:entry{ self->t = 1; } ' -n 'fbt::copy_scn:return{self->t = 0; }' -n 'fbt::vn_rdwr:entry/self->t/{ self->res = args[9]; self->len = args[3]; }' -n 'fbt::vn_rdwr:return/self->t/{ @["mismatch", *self->res >= self->len] = count();  @["vn", arg1] = count(); }' -n 'fbt::core_write:return/self->t/{ @["core", arg1] = count(); }'
dtrace: description 'fbt::copy_scn:entry' matched 2 probes
dtrace: description 'fbt::copy_scn:return' matched 2 probes
dtrace: description 'fbt::vn_rdwr:entry' matched 1 probe
dtrace: description 'fbt::vn_rdwr:return' matched 1 probe
dtrace: description 'fbt::core_write:return' matched 1 probe
^C

  mismatch                                                  1                2
  core                                                              0               24
  mismatch                                                  0               48
  vn                                                                0               50

Further, I briefly stopped in kmdb and found that we had a case where the buf and size arguments to copy_scn() were in fact both NULL and zero. This explains why the copy_scn() call failed and the sections were null. Of course, the question is why did this happen. In this case, the answer is somewhat interesting. In process_scns() we always initialize the buffer and size of it to NULL/0 respectively. As we encounter section headers that we want to save that are larger than the current size, but critically less than elf_datasz_max, we will then reallocate the size of this buffer. Ultimately what happened here is that the first two sections are in fact actually longer than the 1 MiB limit that we have. This means that we won't allocate the buffer and it is in fact not sized correctly.

We should do two things here:

1. Fix this so we correctly dump this.
2. Add a section header equivalent to the program header flag PF_SUNW_FAILURE which could be used to speed this up. Perhaps called SHF_ILLUMOS_FAILURE or equivalent.

Actions

Also available in: Atom PDF