Project

General

Profile

Actions

Bug #13927

closed

core dump of PROT_NONE segment leads to confusing behavior

Added by Robert Mustacchi 4 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal
Category:
kernel
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

When working through #13925 I used mdb on a core file of a rust program and got the following unexpected behavior:

$ mdb core.dropshot.abrt 
mdb: core file data for mapping at fffffc7fed830000 not saved: Bad address
mdb: core file data for mapping at fffffc7fee760000 not saved: Bad address
mdb: core file data for mapping at fffffc7feed80000 not saved: Bad address
mdb: core file data for mapping at fffffc7feedb0000 not saved: Bad address
mdb: core file data for mapping at fffffc7feee00000 not saved: Bad address
mdb: core file data for mapping at fffffc7feee10000 not saved: Bad address
mdb: core file data for mapping at fffffc7feee30000 not saved: Bad address
mdb: core file data for mapping at fffffc7feee40000 not saved: Bad address
mdb: core file data for mapping at fffffc7feef10000 not saved: Bad address
mdb: core file data for mapping at fffffc7feef20000 not saved: Bad address
mdb: core file data for mapping at fffffc7feef30000 not saved: Bad address
mdb: core file data for mapping at fffffc7feef50000 not saved: Bad address
mdb: core file data for mapping at fffffc7feef70000 not saved: Bad address
mdb: core file data for mapping at fffffc7feef90000 not saved: Bad address
mdb: core file data for mapping at fffffc7feefa0000 not saved: Bad address
mdb: core file data for mapping at fffffc7feefc0000 not saved: Bad address
mdb: core file data for mapping at fffffc7feefd0000 not saved: Bad address
mdb: core file data for mapping at fffffc7feeff0000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef000000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef020000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef040000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef050000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef090000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef0a0000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef0c0000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef1b0000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef1c0000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef1e0000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef200000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef210000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef220000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef230000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef250000 not saved: Bad address
mdb: core file data for mapping at fffffc7fef2e0000 not saved: Bad address
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]

Notably, the same thing did not happen to the program when I loaded a core file taken via gcore of the same process before the abort. The first thing I did here was to look at the differences between the dumped regions. I did this by taking a pmap of both cores and the live process before these actions. If we look at one of the mappings of note:

$ pmap core.dropshot.abrt 
core 'core.dropshot.abrt' of 100621:    ./target/debug/examples/basic
...
FFFFFC7FED830000          4K -----*   [ anon ]
...

We have an asterisk here about failing to dump it and notably there are no protection flags specified. When I compared this to the live process and gcore, it was the same except without the asterisk indicating that we failed to dump. This is a region of PROT_NONE -- a guard page.

This is the program header from it via gcore:

Program Header[41]:
    p_vaddr:      0xfffffc7fed830000  p_flags:    0
    p_paddr:      0                   p_type:     [ PT_LOAD ]
    p_filesz:     0x1000              p_memsz:    0x1000
    p_offset:     0x17df37c           p_align:    0

And here is the corresponding header via the kernel:

Program Header[41]:
    p_vaddr:      0xfffffc7fed830000  p_flags:    [ PF_SUNW_FAILURE ]
    p_paddr:      0                   p_type:     [ PT_LOAD ]
    p_filesz:     0                   p_memsz:    0x1000
    p_offset:     0x17df37c           p_align:    0

Ultimately we should probably either write something sparse or note that this is PROT_NONE and not try to read it and have that fail.

Actions #1

Updated by Robert Mustacchi 3 months ago

The way that I ended up dealing with this was by specifically writing a PROT_NONE region as a page of zeros when we encountered it rather than trying to read from the underlying address space in any form.

Actions #2

Updated by Robert Mustacchi 3 months ago

I tested this in a few different ways. The first was that I took the example dropshot program which effectively had a PROT_NONE section and then took a gcore and killed it with a SIGABRT. I went and compared the actual pmap output and verified that we no longer failed to dump it in the kernel case (as well as verified that all other mappings were the same).

I also then wrote a small program that first memset a page to 'a' and then mapped it with PROT_NONE.

$ cat test/mmap.c 
#include <sys/mman.h>
#include <stdio.h>
#include <signal.h>
#include <err.h>
#include <strings.h>

int
main(void)
{
        sigset_t set;
        void *addr;

        addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
        if (addr == NULL) {
                err(1, "failed to mmap 4k");
        }

        memset(addr, 'a', 4096);
        printf("%p\n", addr);
        if (mprotect(addr, 4096, PROT_NONE) != 0) {
                err(1, "failed to mprotect");
        }

        sigemptyset(&set);
        sigsuspend(&set);
        return (0);
}

Here, I verified that when I then took both a gcore and a -ABRT that in mdb the dumped data was read as zeros. Similarly if I changed the data via mdb that was not reflected on the page.

Because this changed kernel memory allocation in the core path, I also nmi'd the system after also doing the testing for #13926 and then ran ::findleaks and found nothing related to what I had changed.

Actions #3

Updated by Electric Monk 3 months ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 5d228828cbfb65f9632a1eedca4291380fca8303

commit  5d228828cbfb65f9632a1eedca4291380fca8303
Author: Robert Mustacchi <rm@fingolfin.org>
Date:   2021-08-03T14:36:50.000Z

    13926 core files can fail to dump leading large sections
    13927 core dump of PROT_NONE segment leads to confusing behavior
    Reviewed by: Jason King <jason.brian.king@gmail.com>
    Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
    Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF