Project

General

Profile

Bug #12777

libc unwinding confused by indirect pointer encoding

Added by Patrick Mooney 6 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
lib - userland libraries
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

Joshua Clulow reported that a rust binary (the accounts test suite for olm-rs) was segfaulting in a manner that suggested bad EH unwind behavior. His initial analysis:

I'm trying to run the tests using my latest illumos Rust toolchain. It's exploding in some kind of exception handling:

    $ pstack core | c++filt
    core 'core' of 58800:   /ws/safari/olm-rs/target/debug/deps/account-1a3261486c079c2e --test-th
     00000000002ca2a1 ???????? ()
     fffffc7fef2ab871 _SUNW_Unwind_RaiseException (7a0e50) + 4d
     0000000000695b79 rust_panic () + 19
     0000000000695a4a std::panicking::rust_panic_with_hook::h076d18f95c7aa9e8
                        () + 18a
     0000000000695606 rust_begin_unwind () + 86
     00000000006ca9ef core::panicking::panic_fmt::h262fde9dbe39fefd () + 2f
     00000000006caa75 core::result::unwrap_failed::h16073982bd056879 () + 75
     0000000000564528 core::result::Result<T,E>::unwrap::hfe4a7faf72a5ce24
                        () + 48
     000000000056cb70 account::remove_one_time_keys_fails::hc1c9d34d01cb5cdc
                        () + 350
     00000000005651b1 account::remove_one_time_keys_fails::{{closure}}::
                        h84ac3c64080e94d0 () + 11
     0000000000564b81 core::ops::function::FnOnce::call_once::
                        h9a27cde9198e737f () + 11
     000000000057c53b test::run_test_in_process::h24e235adda957ee8 () + 1cb
     000000000057bfca test::run_test::run_test_inner::h65eb30c4a4f6bfb3 () + 3ea
     000000000057b83f test::run_test::h4dd21cb050e00b26 () + 2df
     0000000000578815 test::run_tests::h619388e67af2bd60 () + 1115
     000000000058e03b test::console::run_tests_console::h0e34e10957ce1821
                        () + 50b
     0000000000576b15 test::test_main::h25ba3666f8fafb1d () + 225
     0000000000576edf test::test_main_static::h77ffd56098686ca2 () + 14f
     000000000056cc58 account::main::hc4576bc19140554f () + 18
     0000000000563a7e std::rt::lang_start::{{closure}}::h471983fd61030c32 () + e
     0000000000665a9e std::rt::lang_start_internal::hf8ea8139bdfc15df () + 18e
     0000000000563a61 std::rt::lang_start::h5e55069e5b6070c0 () + 41
     000000000056cc8b main () + 2b
     0000000000562903 _start_crt () + 83
     0000000000562868 _start () + 18

I got a list of call instructions in "_Unwind_RaiseException_Body" roughly via ::dis | grep call:

    > ::cat /tmp/bps
    libc.so.1`_Unwind_RaiseException_Body+0x54
    libc.so.1`_Unwind_RaiseException_Body+0x6a
    libc.so.1`_Unwind_RaiseException_Body+0x76
    libc.so.1`_Unwind_RaiseException_Body+0xa3
    libc.so.1`_Unwind_RaiseException_Body+0xbd
    libc.so.1`_Unwind_RaiseException_Body+0xe0
    libc.so.1`_Unwind_RaiseException_Body+0x10c
    libc.so.1`_Unwind_RaiseException_Body+0x125
    libc.so.1`_Unwind_RaiseException_Body+0x153
    libc.so.1`_Unwind_RaiseException_Body+0x16a
    libc.so.1`_Unwind_RaiseException_Body+0x186
    libc.so.1`_Unwind_RaiseException_Body+0x1b3
    libc.so.1`_Unwind_RaiseException_Body+0x1e9
    libc.so.1`_Unwind_RaiseException_Body+0x21e

Installing a breakpoint that disassembles the call instruction and continues:

    > ::delete all
    > ::cat /tmp/bps | ::bp -c '::echo ""; <rip::dis -n 0; :c'
    > :r --test-threads=1

    running 5 tests
    test identity_keys_valid ... ok
    test one_time_keys_valid ... ok
    test remove_one_time_keys ... ok
    test remove_one_time_keys_fails ...

    _Unwind_RaiseException_Body+0x54:  call  -0x240  <libc.so.1`finish_capture>
    _Unwind_RaiseException_Body+0x6a:  call  -0x3dc  <libc.so.1`copy_ctx>
    _Unwind_RaiseException_Body+0x76:  call  -0x3b1  <libc.so.1`ctx_who>
    _Unwind_RaiseException_Body+0xa3:  call  *%r9
    _Unwind_RaiseException_Body+0xbd:  call  -0x228  <libc.so.1`down_one>
    _Unwind_RaiseException_Body+0x76:  call  -0x3b1  <libc.so.1`ctx_who>
    _Unwind_RaiseException_Body+0xa3:  call  *%r9
    mdb: stop on SIGSEGV
    mdb: failed to read instruction at 0x2ca2a1: no mapping for address
    mdb: target stopped at 0x2ca2a1:

I grabbed %r9 before that last indirect call and it has the expected bogus value. So we got it from ctx_who()!

Looking at the arguments to the _Unwind_RaiseException_Body() call itself:

    rdi  7a0e50             struct _Unwind_Exception *exception_object
    rsi  fffffc7fffdfdad0   struct _Unwind_Context *entry_ctx
    rdx  1                  int phase
        _UA_SEARCH_PHASE    1
        _UA_CLEANUP_PHASE   2
        _UA_HANDLER_FRAME   4
        _UA_FORCE_UNWIND    8

    > <rdi::print 'struct _Unwind_Exception'
    {   
        exception_class = 0x4d4f5a0052555354
        exception_cleanup =
            panic_unwind::real_imp::panic::exception_cleanup::h2cb6fc528d47df22
        private_1 = 0
        private_2 = 0
    }

    > <rsi::print 'struct _Unwind_Context'
    {
        entry_regs = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
        current_regs = [ 0, 0, 0, 0x1, 0, 0, 0xfffffc7fffdfdc30, 0, 0, 0,
                            0, 0, 0x724d00, 0, 0x733358, 0xfffffc7fffdfdd20 ]
        cfa = 0x1
        pc = 0x724d00
        ra = 0xa65
        fde = 0x16ba6fa0
        pfn = 0x60
        func = 0x69d4e0
        lsda = 0xfffffc7fef102e50
        range = 0x60
    }

    > 0x724d00::whatis
    724d00 is in /ws/safari/olm-rs/target/debug/deps/account-1a3261486c079c2e
                    [71b000,737000)

So, the call sequence. (phase & _UA_SEARCH_PHASE) seems to be true at entry, so we're in the first arm of the first if in the function.

    call  -0x240  <libc.so.1`finish_capture>
    call  -0x3dc  <libc.so.1`copy_ctx>
    call  -0x3b1  <libc.so.1`ctx_who>
    call  *%r9
    call  -0x228  <libc.so.1`down_one>
    call  -0x3b1  <libc.so.1`ctx_who>
    call  *%r9

Using his initial analysis as a starting point, I built the same test binary and observed the same behavior: segfaulting on a jump "into space". I began tracing where ctx_who() gets its value for a returned personality function (which was the fateful jump.

We can see that's populated from pfn:

static _Unwind_Personality_Fn
ctx_who(struct _Unwind_Context *ctx)
{
        return (ctx->pfn);
}

Further tracing lead me to _Unw_Decode_FDE(), which is what's responsible for populating pfn in that context. It's a rather grimy mess of the eh_frame/DWARF parsing, but some dtrace makes it clearer:

pid$target::ctx_who:return
{
        printf("%s() = %p\n", probefunc, arg1)
}

pid$target::_Unw_Decode_FDE:entry
{
        self->fde = 1;
        printf("%s(%p, %p)\n", probefunc, arg0, arg1);
}
pid$target::_Unw_Decode_FDE:return
{
        self->fde = 0;
        printf("%s() = %p\n", probefunc, arg1);
}

pid$target::_Unw_get_val:entry,
pid$target::get_encoded_val:entry
/self->fde/
{
        printf("%s(%p, %x)\n", probefunc, arg0, arg2);
}

pid$target::_Unw_get_val:return,
pid$target::get_encoded_val:return
/self->fde/
{
        printf("%s() = %p\n", probefunc, arg1);
}

Running against the faulting test binary:

_Unw_Decode_FDE(fffffc7fed9fecd0, fffffc7fed9fef40)
_Unw_get_val(fffffc7fed9fec58, f)
_Unw_get_val() = 20
_Unw_get_val(fffffc7fed9fec58, f)
_Unw_get_val() = 1464
_Unw_get_val(fffffc7fed9fec60, f)
_Unw_get_val() = 14
_Unw_get_val(fffffc7fed9fec60, f)
_Unw_get_val() = 0
_Unw_get_val(fffffc7fed9fec60, b)
_Unw_get_val() = 1
_Unw_get_val(fffffc7fed9fec60, 8)
_Unw_get_val() = 527a
_Unw_get_val(fffffc7fed9fec60, 2)
_Unw_get_val() = 1
_Unw_get_val(fffffc7fed9fec60, 4)
_Unw_get_val() = fffffffffffffff8
_Unw_get_val(fffffc7fed9fec60, b)
_Unw_get_val() = 10
_Unw_get_val(fffffc7fed9fec60, 2)
_Unw_get_val() = 1
_Unw_get_val(fffffc7fed9fec60, b)
_Unw_get_val() = 1b
_Unw_get_val(fffffc7fed9fec58, 6)
get_encoded_val(fffffc7fed9fec58, 1b)
_Unw_get_val(fffffc7fed9fec58, 14)
_Unw_get_val() = 22c478
get_encoded_val() = 6723a0
_Unw_get_val() = 6723a0
_Unw_get_val(fffffc7fed9fec58, 7)
get_encoded_val(fffffc7fed9fec58, 3)
_Unw_get_val(fffffc7fed9fec58, f)
_Unw_get_val() = 294
get_encoded_val() = 294
_Unw_get_val() = 294
_Unw_get_val(fffffc7fed9fec58, 2)
_Unw_get_val() = 0
_Unw_Decode_FDE() = fffffc7fed9fecd0
ctx_who() = fffffc7fef2a8bb7
_Unw_Decode_FDE(fffffc7fed9fecc0, fffffc7fed9feda0)
_Unw_get_val(fffffc7fed9fec48, f)
_Unw_get_val() = 24
_Unw_get_val(fffffc7fed9fec48, f)
_Unw_get_val() = 3f4
_Unw_get_val(fffffc7fed9fec50, f)
_Unw_get_val() = 1c
_Unw_get_val(fffffc7fed9fec50, f)
_Unw_get_val() = 0
_Unw_get_val(fffffc7fed9fec50, b)
_Unw_get_val() = 1
_Unw_get_val(fffffc7fed9fec50, 8)
_Unw_get_val() = 524c507a
_Unw_get_val(fffffc7fed9fec50, 2)
_Unw_get_val() = 1
_Unw_get_val(fffffc7fed9fec50, 4)
_Unw_get_val() = fffffffffffffff8
_Unw_get_val(fffffc7fed9fec50, b)
_Unw_get_val() = 10
_Unw_get_val(fffffc7fed9fec50, 2)
_Unw_get_val() = 7
_Unw_get_val(fffffc7fed9fec50, b)
_Unw_get_val() = 9b
_Unw_get_val(fffffc7fed9fec50, 6)
get_encoded_val(fffffc7fed9fec50, 9b)
_Unw_get_val(fffffc7fed9fec50, 14)
_Unw_get_val() = 2d5225
get_encoded_val() = 2d5225
_Unw_get_val() = 2d5225
_Unw_get_val(fffffc7fed9fec50, b)
_Unw_get_val() = 1b
_Unw_get_val(fffffc7fed9fec50, b)
_Unw_get_val() = 1b
_Unw_get_val(fffffc7fed9fec48, 6)
get_encoded_val(fffffc7fed9fec48, 1b)
_Unw_get_val(fffffc7fed9fec48, 14)
_Unw_get_val() = 22bb38
get_encoded_val() = 672020
_Unw_get_val() = 672020
_Unw_get_val(fffffc7fed9fec48, 7)
get_encoded_val(fffffc7fed9fec48, 3)
_Unw_get_val(fffffc7fed9fec48, f)
_Unw_get_val() = c3
get_encoded_val() = c3
_Unw_get_val() = c3
_Unw_get_val(fffffc7fed9fec48, 2)
_Unw_get_val() = 4
_Unw_get_val(fffffc7fed9fec48, 6)
get_encoded_val(fffffc7fed9fec48, 1b)
_Unw_get_val(fffffc7fed9fec48, 14)
_Unw_get_val() = 2c2ad3
get_encoded_val() = 708fc4
_Unw_get_val() = 708fc4
_Unw_Decode_FDE() = fffffc7fed9fecc0
ctx_who() = 2d5225

That final ctx_who() return value is the address to which we've jumped. Looking back in the trace, we can see where it came from specifically:

_Unw_get_val(fffffc7fed9fec50, 6)
get_encoded_val(fffffc7fed9fec50, 9b)
_Unw_get_val(fffffc7fed9fec50, 14)
_Unw_get_val() = 2d5225
get_encoded_val() = 2d5225
_Unw_get_val() = 2d5225

The one thing of interest here is the encoding (arg1) passed to get_encoded_val(): 0x9b. We can see how it is interpreted:

get_encoded_val(void **datap, ptrdiff_t reloc, int enc)
{
        int val = enc & 0xf;
        int rel = (enc >> 4) & 0xf;

The lower bits (0xb) are handled fine resulting in the _Unw_get_val(fffffc7fed9fec48, 14) call to get the offset. The high bits (rel) are a different story. Let's look at the logic to handle those:

        switch (rel) {
        case 0:
                break;
        case 1:
                if (res != 0)
                        res += loc;
                break;
        default:
                /* remainder not implemented */
                break;
        }

We fall through the case and don't do anything else with that value, which results in it being returned directly and used as the personality function, when there's clearly more to the story.

After some searching, I found that the pointer encoding used for that 0x9b value is not well documented. The high bit (0x80) indicates an "indirect" value, even though it's not present in any(!) specification to date. Existing unwinding implementation such as libunwind and gimli simply had to reverse engineer it from gcc or others who had.

The definitions in gimli are as follows:

// Format of pointer encoding.

// "Unsigned value is encoded using the Little Endian Base 128" 
    DW_EH_PE_uleb128 = 0x1,
// "A 2 bytes unsigned value." 
    DW_EH_PE_udata2 = 0x2,
// "A 4 bytes unsigned value." 
    DW_EH_PE_udata4 = 0x3,
// "An 8 bytes unsigned value." 
    DW_EH_PE_udata8 = 0x4,
// "Signed value is encoded using the Little Endian Base 128" 
    DW_EH_PE_sleb128 = 0x9,
// "A 2 bytes signed value." 
    DW_EH_PE_sdata2 = 0x0a,
// "A 4 bytes signed value." 
    DW_EH_PE_sdata4 = 0x0b,
// "An 8 bytes signed value." 
    DW_EH_PE_sdata8 = 0x0c,

// How the pointer encoding should be applied.

// `DW_EH_PE_pcrel` pointers are relative to their own location.
    DW_EH_PE_pcrel = 0x10,
// "Value is relative to the beginning of the .text section." 
    DW_EH_PE_textrel = 0x20,
// "Value is relative to the beginning of the .got or .eh_frame_hdr section." 
    DW_EH_PE_datarel = 0x30,
// "Value is relative to the beginning of the function." 
    DW_EH_PE_funcrel = 0x40,
// "Value is aligned to an address unit sized boundary." 
    DW_EH_PE_aligned = 0x50,

// This bit can be set for any of the above encoding applications. When set,
// the encoded value is the address of the real pointer result, not the
// pointer result itself.
//
// This isn't defined in the DWARF or the `.eh_frame` standards, but is
// generated by both GNU/Linux and OSX tooling.
    DW_EH_PE_indirect = 0x80,

Clearly we should be extracting the indirect bit so that
1. The subsequent 0x10 value can be used to encode a relative offset
2. We can dereference the resulting address to get the true value

After updating libc to do so, we can see the unwinder find the proper rust_eh_personality function and complete its iteration.

#1

Updated by Patrick Mooney 6 months ago

  • Subject changed from libc unwind machinery confused by indrect pointer encoding to libc unwinding confused by indirect pointer encoding
#2

Updated by Patrick Mooney 6 months ago

I was hoping that either Java or C++ exceptions would exercise this logic, but that does not appear to be the case. Neither seemed to call into our unwinder at all, in fact.
Finding additional test cases to guard against regression has proven difficult so far.

#3

Updated by Electric Monk 6 months ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100

git commit e1fb6a07e9492184a949d5a3ba446ff53b888a2b

commit  e1fb6a07e9492184a949d5a3ba446ff53b888a2b
Author: Patrick Mooney <pmooney@pfmooney.com>
Date:   2020-06-08T19:08:39.000Z

    12777 libc unwinding confused by indirect pointer encoding
    Reviewed by: Robert Mustacchi <rm@fingolfin.org>
    Reviewed by: Jason King <jason.king@joyent.com>
    Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
    Approved by: Dan McDonald <danmcd@joyent.com>

Also available in: Atom PDF