Bug #5012
openmpt_sas fails to issue writes >1MB
0%
Description
OI 151a9 kernel, can try reproducing on latest il-gate if desired.
Got back, from trying to flash a hard drive using sg_write_buffer, an error (the command works on Linux; I'm going to rebuild sg3_utils latest and see if that changes anything), and the following nice message in syslog.
rootnex: [ID 561485 kern.warning] WARNING: mpt_sas: coding error detected, the driver is using ddi_dma_attr(9S) incorrectly. There is a small risk of data corruption in particular with large I/Os. The driver should be replaced with a corrected version for proper system operation. To disable this warning, add 'set rootnex:rootnex_bind_warn=0' to /etc/system(4).
Related issues
Updated by Rich Ercolani almost 9 years ago
fwflash hands back:
sd-GENERIC firmware image verifier: supplied filename /root/foo.LOD exceeds maximum allowable size of 1468006 bytes
Curious to know where that magic number comes from...I don't want to go trawling through the SAS spec, so let's see if we find it in the source first...
Updated by Rich Ercolani almost 9 years ago
Brief annotation with what I found:
- the magic number 1468006 is in fwflash's sd modules for no particular reason, I increased that number
- the same command works on Linux on the same hardware (sg_write_buffer, not fwflash)
- even once you change that number, fwflash just gets back an unhelpful "error" from inside the kernel, which I realized would involve diving into something like sd.c or below
- decided that was a hellhole to explore another day
Updated by Keith Wesolowski over 8 years ago
- Subject changed from mpt_sas logs a nasty error about using ddi_dma_attr wrong to mpt_sas fails to issue writes >1MB
The problem here appears to be that we cannot allocate DMA buffers for writing that are larger than 1MB. It's unclear why this is; our ddi_dma_attr allows 24MB chunks. The error message comes from rootnex.c, where we reach the DDI_DMA_TOOBIG failure (this from an actual attempt to upgrade firmware, with a blob that is markedly smaller than the fwflash magic number):
[root@90-e2-ba-07-0b-60 ~]# dtrace -n 'fbt::rootnex_bind_slowpath:return/pid == $target/ { trace(arg1); stack(); }' -c "fwflash -f /var/tmp/MPGNA280.bin -y -d /dev/rdsk/c0t5000CCA024955121d0s0" dtrace: description 'fbt::rootnex_bind_slowpath:return' matched 1 probe fwflash: Failed to flash firmware file /var/tmp/MPGNA280.bin on device /devices/pci@0,0/pci8086,3c06@2,2/pci15d9,691@0/iport@f/disk@w5000cca024955121,0:a,raw: 1 dtrace: pid 3572 has exited CPU ID FUNCTION:NAME 9 36522 rootnex_bind_slowpath:return 4294967293 rootnex`rootnex_coredma_bindhdl+0x275 rootnex`rootnex_dma_bindhdl+0x4d genunix`ddi_dma_buf_bind_handle+0xa1 mpt_sas`mptsas_scsi_init_pkt+0x193 scsi`scsi_init_pkt+0x63 sd`sd_initpkt_for_uscsi+0x8f sd`sd_start_cmds+0xcc sd`sd_core_iostart+0x90 sd`sd_uscsi_strategy+0xfa genunix`default_physio+0x2db genunix`physio+0x25 scsi`scsi_uscsi_handle_cmd+0x2a5 sd`sd_ssc_send+0x136 sd`sdioctl+0xa96 genunix`cdev_ioctl+0x39 specfs`spec_ioctl+0x60 genunix`fop_ioctl+0x55 genunix`ioctl+0x9b unix`_sys_sysenter_post_swapgs+0x153
Updated by Keith Wesolowski over 8 years ago
Furthermore, we are in the third case here, where sgllen limits us. Here's the code in question:
/* * Figure out if we need to do a partial mapping. If so, figure out * if we need to trim the buffers when we munge the sgl. */ if ((dma->dp_copybuf_size < sinfo->si_copybuf_req) || (dmao->dmao_size > dma->dp_maxxfer) || ((unsigned)attr->dma_attr_sgllen < sinfo->si_sgl_size)) { dma->dp_partial_required = B_TRUE; if (attr->dma_attr_granular != 1) { dma->dp_trim_required = B_TRUE; } } else { dma->dp_partial_required = B_FALSE; dma->dp_trim_required = B_FALSE; }
And here are the actual arguments to the function (ignoring the first and last, which are uninteresting):
[root@90-e2-ba-07-0b-60 ~]# dtrace -n 'fbt::rootnex_bind_slowpath:entry/pid == $target/ { print(*args[1]); print(*args[2]); print(*args[3]); print(*args[4]); }' -c "fwflash -f /var/tmp/MPGNA280.bin -y -d /dev/rdsk/c0t5000CCA024955121d0s0" dtrace: description 'fbt::rootnex_bind_slowpath:entry' matched 1 probe fwflash: Failed to flash firmware file /var/tmp/MPGNA280.bin on device /devices/pci@0,0/pci8086,3c06@2,2/pci15d9,691@0/iport@f/disk@w5000cca024955121,0:a,raw: 1 dtrace: pid 3614 has exited CPU ID FUNCTION:NAME 8 36521 rootnex_bind_slowpath:entry struct ddi_dma_req { ddi_dma_lim_t *dmar_limits = 0xffffff uint_t dmar_flags = 0x1 int (*)() dmar_fp = 0 caddr_t dmar_arg = 0 ddi_dma_obj_t dmar_object = { uint_t dmao_size = 0x118000 ddi_dma_atyp_t dmao_type = DMA_OTYP_BUFVADDR ddi_dma_aobj_t dmao_obj = { struct v_address virt_obj = { caddr_t v_addr = 0x8192008 struct as *v_as = 0xffffff21f448b3c0 void *v_priv = 0xffffff221c920550 } struct pp_address pp_obj = { struct page *pp_pp = 0x8192008 uint_t pp_offset = 0xf448b3c0 } struct phy_address phys_obj = { ulong_t p_addr = 0x8192008 ulong_t p_memtype = 0xffffff21f448b3c0 } struct dvma_address dvma_obj = { size_t dv_off = 0x8192008 size_t dv_nseg = 0xffffff21f448b3c0 struct dvmaseg *dv_seg = 0xffffff221c920550 } } } }rootnex_dma_t { boolean_t dp_partial_required = B_FALSE boolean_t dp_trim_required = B_FALSE boolean_t dp_granularity_power_2 = B_TRUE uint64_t dp_maxxfer = 0xfffe00 boolean_t dp_dvma_used = B_FALSE ddi_dma_obj_t dp_dma = { uint_t dmao_size = 0x118000 ddi_dma_atyp_t dmao_type = DMA_OTYP_BUFVADDR ddi_dma_aobj_t dmao_obj = { struct v_address virt_obj = { caddr_t v_addr = 0x8192008 struct as *v_as = 0xffffff21f448b3c0 void *v_priv = 0xffffff221c920550 } struct pp_address pp_obj = { struct page *pp_pp = 0x8192008 uint_t pp_offset = 0xf448b3c0 } struct phy_address phys_obj = { ulong_t p_addr = 0x8192008 ulong_t p_memtype = 0xffffff21f448b3c0 } struct dvma_address dvma_obj = { size_t dv_off = 0x8192008 size_t dv_nseg = 0xffffff21f448b3c0 struct dvmaseg *dv_seg = 0xffffff221c920550 } } } ddi_dma_obj_t dp_dvma = { uint_t dmao_size = 0xbaddcafe ddi_dma_atyp_t dmao_type = -1159869698 ddi_dma_aobj_t dmao_obj = { struct v_address virt_obj = { caddr_t v_addr = 0xbaddcafebaddcafe struct as *v_as = 0xbaddcafebaddcafe void *v_priv = 0xbaddcafebaddcafe } struct pp_address pp_obj = { struct page *pp_pp = 0xbaddcafebaddcafe uint_t pp_offset = 0xbaddcafe } struct phy_address phys_obj = { ulong_t p_addr = 0xbaddcafebaddcafe ulong_t p_memtype = 0xbaddcafebaddcafe } struct dvma_address dvma_obj = { size_t dv_off = 0xbaddcafebaddcafe size_t dv_nseg = 0xbaddcafebaddcafe struct dvmaseg *dv_seg = 0xbaddcafebaddcafe } } } rootnex_sglinfo_t dp_sglinfo = { boolean_t si_cancross = B_TRUE uint64_t si_min_addr = 0 uint64_t si_max_addr = 0xffffffffffffffff uint64_t si_max_cookie_size = 0xfffe00 uint64_t si_segmask = 0xffffffff uint_t si_flags = 0x600 uint_t si_max_pages = 0x119 boolean_t si_bounce_on_seg = B_FALSE size_t si_copybuf_req = 0 off_t si_buf_offset = 0x8 struct as *si_asp = 0xffffff21f448b3c0 uint_t si_sgl_size = 0x119 } size_t dp_copybuf_size = 0 caddr_t dp_cbaddr = 0 size_t dp_cbsize = 0xbaddcafebaddcafe rootnex_pgmap_t *dp_pgmap = 0xbaddcafebaddcafe uint_t dp_current_win = 0xbaddcafe rootnex_window_t *dp_window = 0 boolean_t dp_need_to_free_window = B_FALSE uint_t dp_window_size = 0xbaddcafe uint_t dp_max_win = 0xbaddcafe dev_info_t *dp_dip = 0xffffff21d249e7f8 kmutex_t dp_mutex = { void *[1] _opaque = [ 0 ] } boolean_t dp_inuse = B_FALSE size_t dp_cookie_size = 0x1a58 ddi_dma_cookie_t *dp_cookies = 0xffffff221c62b000 boolean_t dp_need_to_free_cookie = B_TRUE uint_t dp_current_cookie = 0 ddi_dma_cookie_t *dp_saved_cookies = 0 boolean_t dp_need_to_switch_cookies = B_FALSE void *dp_iommu_private = 0xbaddcafebaddcafe uchar_t *dp_prealloc_buffer = 0xffffff221ac78688 int dp_sleep_flags = 0x1 }ddi_dma_attr_t { uint_t dma_attr_version = 0 uint64_t dma_attr_addr_lo = 0 uint64_t dma_attr_addr_hi = 0xffffffffffffffff uint64_t dma_attr_count_max = 0xffffff uint64_t dma_attr_align = 0x4 uint_t dma_attr_burstsizes = 0x78 uint32_t dma_attr_minxfer = 0x1 uint64_t dma_attr_maxxfer = 0xffffff uint64_t dma_attr_seg = 0xffffffff int dma_attr_sgllen = 0x100 uint32_t dma_attr_granular = 0x200 uint_t dma_attr_flags = 0x600 }ddi_dma_obj_t { uint_t dmao_size = 0x118000 ddi_dma_atyp_t dmao_type = DMA_OTYP_BUFVADDR ddi_dma_aobj_t dmao_obj = { struct v_address virt_obj = { caddr_t v_addr = 0x8192008 struct as *v_as = 0xffffff21f448b3c0 void *v_priv = 0xffffff221c920550 } struct pp_address pp_obj = { struct page *pp_pp = 0x8192008 uint_t pp_offset = 0xf448b3c0 } struct phy_address phys_obj = { ulong_t p_addr = 0x8192008 ulong_t p_memtype = 0xffffff21f448b3c0 } struct dvma_address dvma_obj = { size_t dv_off = 0x8192008 size_t dv_nseg = 0xffffff21f448b3c0 struct dvmaseg *dv_seg = 0xffffff221c920550 } } }
It appears that sgl_cnt here is the transfer size divided by the (smallest) page size, plus 1. We have a maximum limit of MPTSAS_MAX_DMA_SEGS, which is 256. This is how we arrive at a maximum DMA I/O size of just under 1MB. There is even a facility to allow the partial DMA; the target driver can set the PKT_DMA_PARTIAL flag. sd does so based on some logic at 7941:
/* * Adjust the maximum transfer size. This is to fix * the problem of partial DMA support on SPARC. Some * HBA driver, like aac, has very small dma_attr_maxxfer * size, which requires partial DMA support on SPARC. * In the future the SPARC pci nexus driver may solve * the problem instead of this fix. */ max_xfer_size = scsi_ifgetcap(SD_ADDRESS(un), "dma-max", 1); if ((max_xfer_size > 0) && (max_xfer_size < un->un_max_xfer_size)) { /* We need DMA partial even on sparc to ensure sddump() works */ un->un_max_xfer_size = max_xfer_size; if (un->un_partial_dma_supported == 0) un->un_partial_dma_supported = 1; } if (ddi_prop_get_int(DDI_DEV_T_ANY, SD_DEVINFO(un), DDI_PROP_DONTPASS, "buf_break", 0) == 1) { if (ddi_xbuf_attr_setup_brk(un->un_xbuf_attr, un->un_max_xfer_size) == 1) { un->un_buf_breakup_supported = 1; SD_INFO(SD_LOG_ATTACH_DETACH, un, "sd_unit_attach: " "un:0x%p Buf breakup enabled\\n", un); } } /* * Set PKT_DMA_PARTIAL flag. */ if (un->un_partial_dma_supported == 1) { un->un_pkt_flags = PKT_DMA_PARTIAL; } else { un->un_pkt_flags = 0; }
It appears that we need mpt_sas to enforce the dma-max capability to be no greater than PAGE_SIZE * MPTSAS_MAX_DMA_SEGS. But at 9472, mptsas.c does:
case SCSI_CAP_DMA_MAX: rval = (int)mpt->m_msg_dma_attr.dma_attr_maxxfer; break;
Which tells sd we can accept up to 24MB without needing partial DMA. Can sd actually help us here? It can if un_max_xfer_size is 1MB or greater; if it isn't, then we need either the following buf_break functionality or sd itself should be failing the write and we should never be getting to the DMA setup code. Is it? It is not! But wait, there's more: sd can still decide that partial DMA support is needed, and sure enough it has:
> *sd_state::walk softstate | ::print struct sd_lun un_max_xfer_size un_partial_dma_supported un_max_xfer_size = 0x10000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1 un_max_xfer_size = 0x100000 un_partial_dma_supported = 0x1
So we have a bunch of devices where sd has correctly surmised that at most 1MB DMA buffers can be used and that partial DMA support is needed. Sure enough, the right flag is set in the sd_lun:
> *sd_state::walk softstate | ::print struct sd_lun un_pkt_flags un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000 un_pkt_flags = 0x40000
So why isn't it setting PKT_DMA_PARTIAL? sd_initpkt_for_uscsi tells us the bad news:
/* * Allocate the scsi_pkt for the command. * Note: If PKT_DMA_PARTIAL flag is set, scsi_vhci binds a path * during scsi_init_pkt time and will continue to use the * same path as long as the same scsi_pkt is used without * intervening scsi_dma_free(). Since uscsi command does * not call scsi_dmafree() before retry failed command, it * is necessary to make sure PKT_DMA_PARTIAL flag is NOT * set such that scsi_vhci can use other available path for * retry. Besides, ucsci command does not allow DMA breakup, * so there is no need to set PKT_DMA_PARTIAL flag. */ if (uscmd->uscsi_rqlen > SENSE_LENGTH) { pktp = scsi_init_pkt(SD_ADDRESS(un), NULL, ((bp->b_bcount != 0) ? bp : NULL), uscmd->uscsi_cdblen, ((int)(uscmd->uscsi_rqlen) + sizeof (struct scsi_arq_status) - sizeof (struct scsi_extended_sense)), 0, (un->un_pkt_flags & ~PKT_DMA_PARTIAL) | PKT_XARQ, sdrunout, (caddr_t)un); } else { pktp = scsi_init_pkt(SD_ADDRESS(un), NULL, ((bp->b_bcount != 0) ? bp : NULL), uscmd->uscsi_cdblen, sizeof (struct scsi_arq_status), 0, (un->un_pkt_flags & ~PKT_DMA_PARTIAL), sdrunout, (caddr_t)un); }
It's not clear why this must be so, other than that perhaps this is simply a longstanding bug in the uscsi path. It's likely that we could work around this by forcing this flag set for large transfers. Another option would be to create a new ioctl for WRITE BUFFER that would not use the apparently broken uscsi path.
Updated by Marcel Telka over 3 years ago
- Related to Feature #8235: fwflash for sd needs to handle partial writes added