Project

General

Profile

Bug #5012

mpt_sas fails to issue writes >1MB

Added by Rich Ercolani over 5 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
2014-07-17
Due date:
% Done:

0%

Estimated time:
Difficulty:
Medium
Tags:
needs-triage

Description

OI 151a9 kernel, can try reproducing on latest il-gate if desired.

Got back, from trying to flash a hard drive using sg_write_buffer, an error (the command works on Linux; I'm going to rebuild sg3_utils latest and see if that changes anything), and the following nice message in syslog.

rootnex: [ID 561485 kern.warning] WARNING: mpt_sas: coding error detected, the driver is using ddi_dma_attr(9S) incorrectly. There is a small risk of data corruption in particular with large I/Os. The driver should be replaced with a corrected version for proper system operation. To disable this warning, add 'set rootnex:rootnex_bind_warn=0' to /etc/system(4).

Related issues

Related to illumos gate - Feature #8235: fwflash for sd needs to handle partial writesClosed2017-05-15

Actions

History

#1

Updated by Rich Ercolani over 5 years ago

fwflash hands back:

sd-GENERIC firmware image verifier: supplied filename /root/foo.LOD exceeds maximum allowable size of 1468006 bytes

Curious to know where that magic number comes from...I don't want to go trawling through the SAS spec, so let's see if we find it in the source first...

#2

Updated by Rich Ercolani over 5 years ago

Brief annotation with what I found:

- the magic number 1468006 is in fwflash's sd modules for no particular reason, I increased that number
- the same command works on Linux on the same hardware (sg_write_buffer, not fwflash)
- even once you change that number, fwflash just gets back an unhelpful "error" from inside the kernel, which I realized would involve diving into something like sd.c or below
- decided that was a hellhole to explore another day

#3

Updated by Keith Wesolowski about 5 years ago

  • Subject changed from mpt_sas logs a nasty error about using ddi_dma_attr wrong to mpt_sas fails to issue writes >1MB

The problem here appears to be that we cannot allocate DMA buffers for writing that are larger than 1MB. It's unclear why this is; our ddi_dma_attr allows 24MB chunks. The error message comes from rootnex.c, where we reach the DDI_DMA_TOOBIG failure (this from an actual attempt to upgrade firmware, with a blob that is markedly smaller than the fwflash magic number):

[root@90-e2-ba-07-0b-60 ~]# dtrace -n 'fbt::rootnex_bind_slowpath:return/pid == $target/ { trace(arg1); stack(); }' -c "fwflash -f /var/tmp/MPGNA280.bin -y -d /dev/rdsk/c0t5000CCA024955121d0s0" 
dtrace: description 'fbt::rootnex_bind_slowpath:return' matched 1 probe
fwflash: Failed to flash firmware file /var/tmp/MPGNA280.bin on device /devices/pci@0,0/pci8086,3c06@2,2/pci15d9,691@0/iport@f/disk@w5000cca024955121,0:a,raw: 1
dtrace: pid 3572 has exited
CPU     ID                    FUNCTION:NAME
  9  36522     rootnex_bind_slowpath:return        4294967293
              rootnex`rootnex_coredma_bindhdl+0x275
              rootnex`rootnex_dma_bindhdl+0x4d
              genunix`ddi_dma_buf_bind_handle+0xa1
              mpt_sas`mptsas_scsi_init_pkt+0x193
              scsi`scsi_init_pkt+0x63
              sd`sd_initpkt_for_uscsi+0x8f
              sd`sd_start_cmds+0xcc
              sd`sd_core_iostart+0x90
              sd`sd_uscsi_strategy+0xfa
              genunix`default_physio+0x2db
              genunix`physio+0x25
              scsi`scsi_uscsi_handle_cmd+0x2a5
              sd`sd_ssc_send+0x136
              sd`sdioctl+0xa96
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`_sys_sysenter_post_swapgs+0x153
#4

Updated by Keith Wesolowski about 5 years ago

Furthermore, we are in the third case here, where sgllen limits us. Here's the code in question:

        /*
         * Figure out if we need to do a partial mapping. If so, figure out
         * if we need to trim the buffers when we munge the sgl.
         */
        if ((dma->dp_copybuf_size < sinfo->si_copybuf_req) ||
            (dmao->dmao_size > dma->dp_maxxfer) ||
            ((unsigned)attr->dma_attr_sgllen < sinfo->si_sgl_size)) {
                dma->dp_partial_required = B_TRUE;
                if (attr->dma_attr_granular != 1) {
                        dma->dp_trim_required = B_TRUE;
                }
        } else {
                dma->dp_partial_required = B_FALSE;
                dma->dp_trim_required = B_FALSE;
        }

And here are the actual arguments to the function (ignoring the first and last, which are uninteresting):

[root@90-e2-ba-07-0b-60 ~]# dtrace -n 'fbt::rootnex_bind_slowpath:entry/pid == $target/ { print(*args[1]); print(*args[2]); print(*args[3]); print(*args[4]); }' -c "fwflash -f /var/tmp/MPGNA280.bin -y -d /dev/rdsk/c0t5000CCA024955121d0s0" 
dtrace: description 'fbt::rootnex_bind_slowpath:entry' matched 1 probe
fwflash: Failed to flash firmware file /var/tmp/MPGNA280.bin on device /devices/pci@0,0/pci8086,3c06@2,2/pci15d9,691@0/iport@f/disk@w5000cca024955121,0:a,raw: 1
dtrace: pid 3614 has exited
CPU     ID                    FUNCTION:NAME
  8  36521      rootnex_bind_slowpath:entry struct ddi_dma_req {
    ddi_dma_lim_t *dmar_limits = 0xffffff
    uint_t dmar_flags = 0x1
    int (*)() dmar_fp = 0
    caddr_t dmar_arg = 0
    ddi_dma_obj_t dmar_object = {
        uint_t dmao_size = 0x118000
        ddi_dma_atyp_t dmao_type = DMA_OTYP_BUFVADDR
        ddi_dma_aobj_t dmao_obj = {
            struct v_address virt_obj = {
                caddr_t v_addr = 0x8192008
                struct as *v_as = 0xffffff21f448b3c0
                void *v_priv = 0xffffff221c920550
            }
            struct pp_address pp_obj = {
                struct page *pp_pp = 0x8192008
                uint_t pp_offset = 0xf448b3c0
            }
            struct phy_address phys_obj = {
                ulong_t p_addr = 0x8192008
                ulong_t p_memtype = 0xffffff21f448b3c0
            }
            struct dvma_address dvma_obj = {
                size_t dv_off = 0x8192008
                size_t dv_nseg = 0xffffff21f448b3c0
                struct dvmaseg *dv_seg = 0xffffff221c920550
            }
        }
    }
}rootnex_dma_t {
    boolean_t dp_partial_required = B_FALSE
    boolean_t dp_trim_required = B_FALSE
    boolean_t dp_granularity_power_2 = B_TRUE
    uint64_t dp_maxxfer = 0xfffe00
    boolean_t dp_dvma_used = B_FALSE
    ddi_dma_obj_t dp_dma = {
        uint_t dmao_size = 0x118000
        ddi_dma_atyp_t dmao_type = DMA_OTYP_BUFVADDR
        ddi_dma_aobj_t dmao_obj = {
            struct v_address virt_obj = {
                caddr_t v_addr = 0x8192008
                struct as *v_as = 0xffffff21f448b3c0
                void *v_priv = 0xffffff221c920550
            }
            struct pp_address pp_obj = {
                struct page *pp_pp = 0x8192008
                uint_t pp_offset = 0xf448b3c0
            }
            struct phy_address phys_obj = {
                ulong_t p_addr = 0x8192008
                ulong_t p_memtype = 0xffffff21f448b3c0
            }
            struct dvma_address dvma_obj = {
                size_t dv_off = 0x8192008
                size_t dv_nseg = 0xffffff21f448b3c0
                struct dvmaseg *dv_seg = 0xffffff221c920550
            }
        }
    }
    ddi_dma_obj_t dp_dvma = {
        uint_t dmao_size = 0xbaddcafe
        ddi_dma_atyp_t dmao_type = -1159869698
        ddi_dma_aobj_t dmao_obj = {
            struct v_address virt_obj = {
                caddr_t v_addr = 0xbaddcafebaddcafe
                struct as *v_as = 0xbaddcafebaddcafe
                void *v_priv = 0xbaddcafebaddcafe
            }
            struct pp_address pp_obj = {
                struct page *pp_pp = 0xbaddcafebaddcafe
                uint_t pp_offset = 0xbaddcafe
            }
            struct phy_address phys_obj = {
                ulong_t p_addr = 0xbaddcafebaddcafe
                ulong_t p_memtype = 0xbaddcafebaddcafe
            }
            struct dvma_address dvma_obj = {
                size_t dv_off = 0xbaddcafebaddcafe
                size_t dv_nseg = 0xbaddcafebaddcafe
                struct dvmaseg *dv_seg = 0xbaddcafebaddcafe
            }
        }
    }
    rootnex_sglinfo_t dp_sglinfo = {
        boolean_t si_cancross = B_TRUE
        uint64_t si_min_addr = 0
        uint64_t si_max_addr = 0xffffffffffffffff
        uint64_t si_max_cookie_size = 0xfffe00
        uint64_t si_segmask = 0xffffffff
        uint_t si_flags = 0x600
        uint_t si_max_pages = 0x119
        boolean_t si_bounce_on_seg = B_FALSE
        size_t si_copybuf_req = 0
        off_t si_buf_offset = 0x8
        struct as *si_asp = 0xffffff21f448b3c0
        uint_t si_sgl_size = 0x119
    }
    size_t dp_copybuf_size = 0
    caddr_t dp_cbaddr = 0
    size_t dp_cbsize = 0xbaddcafebaddcafe
    rootnex_pgmap_t *dp_pgmap = 0xbaddcafebaddcafe
    uint_t dp_current_win = 0xbaddcafe
    rootnex_window_t *dp_window = 0
    boolean_t dp_need_to_free_window = B_FALSE
    uint_t dp_window_size = 0xbaddcafe
    uint_t dp_max_win = 0xbaddcafe
    dev_info_t *dp_dip = 0xffffff21d249e7f8
    kmutex_t dp_mutex = {
        void *[1] _opaque = [ 0 ]
    }
    boolean_t dp_inuse = B_FALSE
    size_t dp_cookie_size = 0x1a58
    ddi_dma_cookie_t *dp_cookies = 0xffffff221c62b000
    boolean_t dp_need_to_free_cookie = B_TRUE
    uint_t dp_current_cookie = 0
    ddi_dma_cookie_t *dp_saved_cookies = 0
    boolean_t dp_need_to_switch_cookies = B_FALSE
    void *dp_iommu_private = 0xbaddcafebaddcafe
    uchar_t *dp_prealloc_buffer = 0xffffff221ac78688
    int dp_sleep_flags = 0x1
}ddi_dma_attr_t {
    uint_t dma_attr_version = 0
    uint64_t dma_attr_addr_lo = 0
    uint64_t dma_attr_addr_hi = 0xffffffffffffffff
    uint64_t dma_attr_count_max = 0xffffff
    uint64_t dma_attr_align = 0x4
    uint_t dma_attr_burstsizes = 0x78
    uint32_t dma_attr_minxfer = 0x1
    uint64_t dma_attr_maxxfer = 0xffffff
    uint64_t dma_attr_seg = 0xffffffff
    int dma_attr_sgllen = 0x100
    uint32_t dma_attr_granular = 0x200
    uint_t dma_attr_flags = 0x600
}ddi_dma_obj_t {
    uint_t dmao_size = 0x118000
    ddi_dma_atyp_t dmao_type = DMA_OTYP_BUFVADDR
    ddi_dma_aobj_t dmao_obj = {
        struct v_address virt_obj = {
            caddr_t v_addr = 0x8192008
            struct as *v_as = 0xffffff21f448b3c0
            void *v_priv = 0xffffff221c920550
        }
        struct pp_address pp_obj = {
            struct page *pp_pp = 0x8192008
            uint_t pp_offset = 0xf448b3c0
        }
        struct phy_address phys_obj = {
            ulong_t p_addr = 0x8192008
            ulong_t p_memtype = 0xffffff21f448b3c0
        }
        struct dvma_address dvma_obj = {
            size_t dv_off = 0x8192008
            size_t dv_nseg = 0xffffff21f448b3c0
            struct dvmaseg *dv_seg = 0xffffff221c920550
        }
    }
}

It appears that sgl_cnt here is the transfer size divided by the (smallest) page size, plus 1. We have a maximum limit of MPTSAS_MAX_DMA_SEGS, which is 256. This is how we arrive at a maximum DMA I/O size of just under 1MB. There is even a facility to allow the partial DMA; the target driver can set the PKT_DMA_PARTIAL flag. sd does so based on some logic at 7941:

        /*
         * Adjust the maximum transfer size. This is to fix
         * the problem of partial DMA support on SPARC. Some
         * HBA driver, like aac, has very small dma_attr_maxxfer
         * size, which requires partial DMA support on SPARC.
         * In the future the SPARC pci nexus driver may solve
         * the problem instead of this fix.
         */
        max_xfer_size = scsi_ifgetcap(SD_ADDRESS(un), "dma-max", 1);
        if ((max_xfer_size > 0) && (max_xfer_size < un->un_max_xfer_size)) {
                /* We need DMA partial even on sparc to ensure sddump() works */
                un->un_max_xfer_size = max_xfer_size;
                if (un->un_partial_dma_supported == 0)
                        un->un_partial_dma_supported = 1;
        }
        if (ddi_prop_get_int(DDI_DEV_T_ANY, SD_DEVINFO(un),
            DDI_PROP_DONTPASS, "buf_break", 0) == 1) {
                if (ddi_xbuf_attr_setup_brk(un->un_xbuf_attr,
                    un->un_max_xfer_size) == 1) {
                        un->un_buf_breakup_supported = 1;
                        SD_INFO(SD_LOG_ATTACH_DETACH, un, "sd_unit_attach: " 
                            "un:0x%p Buf breakup enabled\\n", un);
                }
        }

        /*
         * Set PKT_DMA_PARTIAL flag.
         */
        if (un->un_partial_dma_supported == 1) {
                un->un_pkt_flags = PKT_DMA_PARTIAL;
        } else {
                un->un_pkt_flags = 0;
        }

It appears that we need mpt_sas to enforce the dma-max capability to be no greater than PAGE_SIZE * MPTSAS_MAX_DMA_SEGS. But at 9472, mptsas.c does:

        case SCSI_CAP_DMA_MAX:
                rval = (int)mpt->m_msg_dma_attr.dma_attr_maxxfer;
                break;

Which tells sd we can accept up to 24MB without needing partial DMA. Can sd actually help us here? It can if un_max_xfer_size is 1MB or greater; if it isn't, then we need either the following buf_break functionality or sd itself should be failing the write and we should never be getting to the DMA setup code. Is it? It is not! But wait, there's more: sd can still decide that partial DMA support is needed, and sure enough it has:

> *sd_state::walk softstate | ::print struct sd_lun un_max_xfer_size un_partial_dma_supported
un_max_xfer_size = 0x10000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1
un_max_xfer_size = 0x100000
un_partial_dma_supported = 0x1

So we have a bunch of devices where sd has correctly surmised that at most 1MB DMA buffers can be used and that partial DMA support is needed. Sure enough, the right flag is set in the sd_lun:

> *sd_state::walk softstate | ::print struct sd_lun un_pkt_flags
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000
un_pkt_flags = 0x40000

So why isn't it setting PKT_DMA_PARTIAL? sd_initpkt_for_uscsi tells us the bad news:

        /*
         * Allocate the scsi_pkt for the command.
         * Note: If PKT_DMA_PARTIAL flag is set, scsi_vhci binds a path
         *       during scsi_init_pkt time and will continue to use the
         *       same path as long as the same scsi_pkt is used without
         *       intervening scsi_dma_free(). Since uscsi command does
         *       not call scsi_dmafree() before retry failed command, it
         *       is necessary to make sure PKT_DMA_PARTIAL flag is NOT
         *       set such that scsi_vhci can use other available path for
         *       retry. Besides, ucsci command does not allow DMA breakup,
         *       so there is no need to set PKT_DMA_PARTIAL flag.
         */
        if (uscmd->uscsi_rqlen > SENSE_LENGTH) {
                pktp = scsi_init_pkt(SD_ADDRESS(un), NULL,
                    ((bp->b_bcount != 0) ? bp : NULL), uscmd->uscsi_cdblen,
                    ((int)(uscmd->uscsi_rqlen) + sizeof (struct scsi_arq_status)
                    - sizeof (struct scsi_extended_sense)), 0,
                    (un->un_pkt_flags & ~PKT_DMA_PARTIAL) | PKT_XARQ,
                    sdrunout, (caddr_t)un);
        } else {
                pktp = scsi_init_pkt(SD_ADDRESS(un), NULL,
                    ((bp->b_bcount != 0) ? bp : NULL), uscmd->uscsi_cdblen,
                    sizeof (struct scsi_arq_status), 0,
                    (un->un_pkt_flags & ~PKT_DMA_PARTIAL),
                    sdrunout, (caddr_t)un);
        }

It's not clear why this must be so, other than that perhaps this is simply a longstanding bug in the uscsi path. It's likely that we could work around this by forcing this flag set for large transfers. Another option would be to create a new ioctl for WRITE BUFFER that would not use the apparently broken uscsi path.

#5

Updated by Marcel Telka 3 months ago

  • Related to Feature #8235: fwflash for sd needs to handle partial writes added

Also available in: Atom PDF