Bug #3425
closedfirst write to a new zvol can fail with EFBIG
100%
Description
When creating a new zvol, we have seen that the very first write to it can fail with EFBIG. When writing to a zvol, zfs goes through and chunks it up in zvol read to be at most 5 MB (DMU_MAX_ACCESS >> 1). This can be trivially reproduced with something like:
1. zfs create -V 30G zones/fatty 2. /usr/bin/dd if=/dev/urandom of=/dev/zvol/rdsk/zones/fatty bs=$((5 * 1024 * 1024))
If we DTrace the zfs bits in question here, this is working exactly as designed. Once we go through this bit of code we have the following call order:
zvol_write->dmu_tx_hold_write->dmu_tx_count_write which is where we are getting the EFBIG from. The code in dmu_tx_count_write has the following snippet:
380 if (txh->txh_space_towrite + txh->txh_space_tooverwrite > 381 2 * DMU_MAX_ACCESS) 382 err = EFBIG; 383 384 if (err) 385 txh->txh_tx->tx_err = err;
If we are writing more than 20 MB then we are going to flag this error for this transaction. Using a bit of DTrace we can see what the actual values for the txh are:
[root@headnode (bh1-kvm2:0) ~]# dtrace -n 'fbt::dmu_tx_count_write:entry/execname=="dd"/{ trace(arg1); trace(arg2); self->txh = arg0; print(*(dmu_tx_hold_t *)arg0); }' -n 'fbt::dmu_tx_count_write:return/self->txh/{ print(*(dmu_tx_hold_t *)self->txh); self->txh = 0; exit(0); }' dtrace: description 'fbt::dmu_tx_count_write:entry' matched 1 probe dtrace: description 'fbt::dmu_tx_count_write:return' matched 1 probe CPU ID FUNCTION:NAME 7 47811 dmu_tx_count_write:entry 0 5242880 dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff0444d04c98 list_node_t txh_node = { struct list_node *list_next = 0xffffff0444d04ca8 struct list_node *list_prev = 0xffffff0444d04ca8 } struct dnode *txh_dnode = 0xffffff0307952a30 uint64_t txh_space_towrite = 0 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 7 47812 dmu_tx_count_write:return dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff0444d04c98 list_node_t txh_node = { struct list_node *list_next = 0xffffff0444d04ca8 struct list_node *list_prev = 0xffffff0444d04ca8 } struct dnode *txh_dnode = 0xffffff0307952a30 uint64_t txh_space_towrite = 0x1c1a000 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 }
From here we can see that the value of txh_space_towrite is indeed greater than 20 MB. If you look at the difference in terms of space requested, we're writing an extra 20+ MB for our original 5 MB request. If we do this with a modified DTrace script and vary the block size we can see that there are varying levels of extra metadata that we need to write for this first write.
dtrace -n 'fbt::dmu_tx_count_write:entry/execname=="dd"/{ trace(arg1); trace(arg2); self->txh = arg0; print(*(dmu_tx_hold_t *)arg0); }' -n 'fbt::dmu_tx_count_write:return/self->txh/{ print(*(dmu_tx_hold_t *)self->txh); self->txh = 0; i++; }' -n 'fbt::dmu_tx_count_write:return/i == 2/{ exit(0); }'
With a 1 byte dd block size:
CPU ID FUNCTION:NAME 2 47811 dmu_tx_count_write:entry 0 131072dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47812 dmu_tx_count_write:return dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0xf6000 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47811 dmu_tx_count_write:entry 131072 131072dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47812 dmu_tx_count_write:return dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0x40000 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47812 dmu_tx_count_write:return
With a 128k block size:
/usr/bin/dd if=/dev/urandom of=/dev/zvol/rdsk/zones/fatty bs=128k CPU ID FUNCTION:NAME 2 47811 dmu_tx_count_write:entry 0 131072dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47812 dmu_tx_count_write:return dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0xf6000 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47811 dmu_tx_count_write:entry 131072 131072dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47812 dmu_tx_count_write:return dmu_tx_hold_t { dmu_tx_t *txh_tx = 0xffffff02fef8d968 list_node_t txh_node = { struct list_node *list_next = 0xffffff02fef8d978 struct list_node *list_prev = 0xffffff02fef8d978 } struct dnode *txh_dnode = 0xffffff03078aa740 uint64_t txh_space_towrite = 0x40000 uint64_t txh_space_tofree = 0 uint64_t txh_space_tooverwrite = 0 uint64_t txh_space_tounref = 0 uint64_t txh_memory_tohold = 0 uint64_t txh_fudge = 0 } 2 47812 dmu_tx_count_write:return
From looking at the code there is always going to some additional growth due to indirect blocks and the like. What isn't clear is why if you do a first write with say a BS=1M and then come back and switch to the 5M that it works just fine.
Updated by George Wilson over 10 years ago
- Status changed from New to In Progress
- Assignee set to George Wilson
We are over estimating the usage because we're using the wrong indirect block size. I have a fix that will be sent out for review shortly.
Updated by Christopher Siden over 10 years ago
- Status changed from In Progress to Closed
- % Done changed from 0 to 100
commit bda8819 Author: George Wilson <george.wilson@delphix.com> Date: Wed Jan 30 12:01:51 2013 3422 zpool create/syseventd race yield non-importable pool 3425 first write to a new zvol can fail with EFBIG Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org>