Project

General

Profile

Actions

Bug #12084

closed

idm_conn_event_handler mishandles CA_DROP action

Added by Mike Gerdts almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
comstar - iSCSI/FC/SAS target
Start date:
Due date:
% Done:

100%

Estimated time:
Difficulty:
Medium
Tags:
Gerrit CR:

Description

While testing a fix for #4454, I got:

> ::status
debugging crash dump vmcore.12 (64-bit) from omni-2
operating system: 5.11 omni-physio-0-g59a9ce0e70 (i86pc)
build version: gfx-drm - heads/master-0-gbdc58b1-dirty

image uuid: 6c62e915-bc81-4cd5-cc07-ecb85608e869
panic message: assertion failed: 0, file: ../../common/io/idm/idm_conn_sm.c, line: 424
dump content: kernel pages only
> $C
ffffff001ef78ac0 vpanic()
ffffff001ef78b10 0xfffffffffbe2a425()
ffffff001ef78b50 idm_conn_event_handler+0x139(ffffff076697e4d8)
ffffff001ef78c00 taskq_thread+0x315(ffffff0766989628)
ffffff001ef78c10 thread_start+0xb()

The test involves having a zpool with a mirror vdev and a spare sitting atop 3 iscsi disks. A dd loop writes to a file in that pool while another loop offlines an LU, sleeps 15 seconds, onlines it, then sleeps 45 seconds before repeating the loop. Scripts are in #12076.

Near line 424 we have:

 405         if (event_ctx->iec_pdu_event_type != CT_NONE) {
 406                 switch (action) {
 407                 case CA_TX_PROTOCOL_ERROR:
 408                         idm_pdu_tx_protocol_error(ic, pdu);
 409                         break;
 410                 case CA_RX_PROTOCOL_ERROR:
 411                         idm_pdu_rx_protocol_error(ic, pdu);
 412                         break;
 413                 case CA_FORWARD:
 414                         if (!event_ctx->iec_pdu_forwarded) {
 415                                 if (event_ctx->iec_pdu_event_type ==
 416                                     CT_RX_PDU) {
 417                                         idm_pdu_rx_forward(ic, pdu);
 418                                 } else {
 419                                         idm_pdu_tx_forward(ic, pdu);
 420                                 }
 421                         }
 422                         break;
 423                 default:
 424                         ASSERT(0);
 425                         break;
 426                 }
 427         }

The `if` block and `case` statements are:

idm_conn_event_handler+0x100:   movl   0x18(%rbx),%eax
idm_conn_event_handler+0x103:   testl  %eax,%eax
idm_conn_event_handler+0x105:   je     +0x5c    <idm_conn_event_handler+0x163>
idm_conn_event_handler+0x107:   cmpl   $0x1,%r14d
idm_conn_event_handler+0x10b:   je     +0x2af   <idm_conn_event_handler+0x3c0>
idm_conn_event_handler+0x111:   jb     +0x291   <idm_conn_event_handler+0x3a8>
idm_conn_event_handler+0x117:   cmpl   $0x2,%r14d
idm_conn_event_handler+0x11b:   je     +0x15f   <idm_conn_event_handler+0x280>
idm_conn_event_handler+0x121:   movl   $0x1a8,%edx
idm_conn_event_handler+0x126:   movq   $0xfffffffff82479b0,%rsi
idm_conn_event_handler+0x12d:   movq   $0xfffffffff824692b,%rdi
idm_conn_event_handler+0x134:   call   +0x3be9d97       <assfail>
idm_conn_event_handler+0x139:   movl   0x18(%rbx),%eax

Thus it seems that the value of action is in r14

> <r14=D
                3

That is CA_DROP.

203 typedef enum {
204         CA_TX_PROTOCOL_ERROR,   /* Send "protocol error" to state machine */
205         CA_RX_PROTOCOL_ERROR,   /* Send "protocol error" to state machine */
206         CA_FORWARD,             /* State machine event and forward to client */
207         CA_DROP                 /* Drop PDU */
208 } idm_pdu_event_action_t;

Earlier CA_DROP is handled.

 338                 case CA_DROP:
 339                         /*
 340                          * It never even happened
 341                          */
 342                         IDM_SM_LOG(CE_NOTE, "*** drop PDU %p", (void *) pdu);
 343                         idm_pdu_complete(pdu, IDM_STATUS_FAIL);
 344                         break;

Which is consistent with the fact that the pdu has been freed by idm_pdu_complete().

> <event_ctx::print idm_conn_event_ctx_t iec_info | ::whatis
ffffff0729887030 is freed from idm_rx_pdu_cache:
            ADDR          BUFADDR        TIMESTAMP           THREAD
                            CACHE          LASTLOG         CONTENTS
ffffff072efccea8 ffffff0729887030      22e24f71bc7 ffffff001ef78c20
                 ffffff0726beb008 ffffff06e402e800 ffffff06fdecf768
                 kmem_cache_free_debug+0xfa
                 kmem_cache_free+0x86
                 idm_sorx_cache_pdu_cb+0x2a
                 idm_sorx_addl_pdu_cb+0x68
                 idm_pdu_complete+0x17
                 idm_conn_event_handler+0xc2
                 taskq_thread+0x315
                 thread_start+0xb
Actions #2

Updated by Mike Gerdts almost 2 years ago

  • Category set to kernel

I think that idm is on the initiator side, so setting category to kernel rather than comstar. But maybe comstar is more appropriate because of the general area that this is in. :shrug:

Actions #3

Updated by Mike Gerdts almost 2 years ago

  • Subject changed from use after free in idm_conn_event_handler to idm_conn_event_handler mishandles CA_DROP action
  • Description updated (diff)
  • Assignee set to Mike Gerdts
Actions #4

Updated by Mike Gerdts almost 2 years ago

  • Description updated (diff)
Actions #5

Updated by Mike Gerdts almost 2 years ago

  • Category changed from kernel to comstar - iSCSI/FC/SAS target

The fix:

From the switch statement at the top of the function, it is clear that CA_DROP is expected and in non-debug builds it is ignored in the second switch statement. The fix will make it so that the second switch statement is aware of CA_DROP but takes no action on the pdu as it has already been freed. To make it more clear that the pdu is freed, the pointer to it (event_ctx->iec_info) is set to NULL near where it is freed and ASSERTed to be NULL later.

The test plan:

Since I've hit this a couple times while testing the fix for #4454, my ongoing testing of #4454 includes the following dtrace script to be sure this path is visited:

#! /usr/sbin/dtrace -s

idm_conn_event_handler:entry
{
        self->interest = 1;
}

idm_conn_sm_validate_pdu:return
/self->interest && arg1 == 3/
{
        self->survived = 1;
        printf("\n%Y in problematic code path\n", walltimestamp);
        stack();
}

idm_conn_event_handler:return
/self->survived/
{
        printf("Survived!\n");
        self->survived = 0;
}

idm_conn_event_handler:return
{
        self->interest = 0;
}

With the problematic condition is encountered, this script should give a stack trace that looks similar to that found in the description.

Actions #6

Updated by Jerry Jelinek almost 2 years ago

  • Assignee changed from Mike Gerdts to Jerry Jelinek
Actions #7

Updated by Jerry Jelinek almost 2 years ago

For testing, I followed Mike's procedures from the bug description which he used to hit this bug. I'm copying that here since there is a lot of text above.

The test involves having a zpool with a mirror vdev and a spare sitting atop 3 iscsi disks. A dd loop writes to a file in that pool while another loop offlines an LU, sleeps 15 seconds, onlines it, then sleeps 45 seconds before repeating the loop. Scripts are in #12076.

Actions #8

Updated by Electric Monk almost 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

git commit 5fe3b0929d8a195dbfa4196d72a3f9b15b745171

commit  5fe3b0929d8a195dbfa4196d72a3f9b15b745171
Author: Mike Gerdts <mike.gerdts@joyent.com>
Date:   2020-02-04T14:09:56.000Z

    12084 idm_conn_event_handler mishandles CA_DROP action
    Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Reviewed by: Andy Fiddaman <andy@omniosce.org>
    Reviewed by: C Fraire <cfraire@me.com>
    Approved by: Dan McDonald <danmcd@joyent.com>

Actions

Also available in: Atom PDF