Project

General

Profile

Bug #3425

first write to a new zvol can fail with EFBIG

Added by Robert Mustacchi almost 8 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
zfs - Zettabyte File System
Start date:
2012-12-20
Due date:
% Done:

100%

Estimated time:
Difficulty:
Hard
Tags:
needs-triage
Gerrit CR:

Description

When creating a new zvol, we have seen that the very first write to it can fail with EFBIG. When writing to a zvol, zfs goes through and chunks it up in zvol read to be at most 5 MB (DMU_MAX_ACCESS >> 1). This can be trivially reproduced with something like:

1. zfs create -V 30G zones/fatty
2. /usr/bin/dd if=/dev/urandom of=/dev/zvol/rdsk/zones/fatty bs=$((5 * 1024 * 1024))

If we DTrace the zfs bits in question here, this is working exactly as designed. Once we go through this bit of code we have the following call order:

zvol_write->dmu_tx_hold_write->dmu_tx_count_write which is where we are getting the EFBIG from. The code in dmu_tx_count_write has the following snippet:

380         if (txh->txh_space_towrite + txh->txh_space_tooverwrite >
 381             2 * DMU_MAX_ACCESS)
 382                 err = EFBIG;
 383 
 384         if (err)
 385                 txh->txh_tx->tx_err = err;

If we are writing more than 20 MB then we are going to flag this error for this transaction. Using a bit of DTrace we can see what the actual values for the txh are:

[root@headnode (bh1-kvm2:0) ~]# dtrace -n 'fbt::dmu_tx_count_write:entry/execname=="dd"/{ trace(arg1); trace(arg2); self->txh = arg0; print(*(dmu_tx_hold_t *)arg0); }' -n 'fbt::dmu_tx_count_write:return/self->txh/{ print(*(dmu_tx_hold_t *)self->txh); self->txh = 0; exit(0); }'
dtrace: description 'fbt::dmu_tx_count_write:entry' matched 1 probe
dtrace: description 'fbt::dmu_tx_count_write:return' matched 1 probe
CPU     ID                    FUNCTION:NAME
  7  47811         dmu_tx_count_write:entry                 0          5242880
dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff0444d04c98
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff0444d04ca8
        struct list_node *list_prev = 0xffffff0444d04ca8
    }
    struct dnode *txh_dnode = 0xffffff0307952a30
    uint64_t txh_space_towrite = 0
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  7  47812        dmu_tx_count_write:return dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff0444d04c98
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff0444d04ca8
        struct list_node *list_prev = 0xffffff0444d04ca8
    }
    struct dnode *txh_dnode = 0xffffff0307952a30
    uint64_t txh_space_towrite = 0x1c1a000
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}

From here we can see that the value of txh_space_towrite is indeed greater than 20 MB. If you look at the difference in terms of space requested, we're writing an extra 20+ MB for our original 5 MB request. If we do this with a modified DTrace script and vary the block size we can see that there are varying levels of extra metadata that we need to write for this first write.

dtrace -n 'fbt::dmu_tx_count_write:entry/execname=="dd"/{ trace(arg1); trace(arg2); self->txh = arg0; print(*(dmu_tx_hold_t *)arg0); }' -n 'fbt::dmu_tx_count_write:return/self->txh/{ print(*(dmu_tx_hold_t *)self->txh); self->txh = 0; i++; }' -n 'fbt::dmu_tx_count_write:return/i == 2/{ exit(0); }'

With a 1 byte dd block size:

CPU     ID                    FUNCTION:NAME
  2  47811         dmu_tx_count_write:entry                 0           131072dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47812        dmu_tx_count_write:return dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0xf6000
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47811         dmu_tx_count_write:entry            131072           131072dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47812        dmu_tx_count_write:return dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0x40000
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47812        dmu_tx_count_write:return

With a 128k block size:

/usr/bin/dd if=/dev/urandom of=/dev/zvol/rdsk/zones/fatty bs=128k
CPU     ID                    FUNCTION:NAME
  2  47811         dmu_tx_count_write:entry                 0           131072dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47812        dmu_tx_count_write:return dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0xf6000
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47811         dmu_tx_count_write:entry            131072           131072dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47812        dmu_tx_count_write:return dmu_tx_hold_t {
    dmu_tx_t *txh_tx = 0xffffff02fef8d968
    list_node_t txh_node = {
        struct list_node *list_next = 0xffffff02fef8d978
        struct list_node *list_prev = 0xffffff02fef8d978
    }
    struct dnode *txh_dnode = 0xffffff03078aa740
    uint64_t txh_space_towrite = 0x40000
    uint64_t txh_space_tofree = 0
    uint64_t txh_space_tooverwrite = 0
    uint64_t txh_space_tounref = 0
    uint64_t txh_memory_tohold = 0
    uint64_t txh_fudge = 0
}
  2  47812        dmu_tx_count_write:return

From looking at the code there is always going to some additional growth due to indirect blocks and the like. What isn't clear is why if you do a first write with say a BS=1M and then come back and switch to the 5M that it works just fine.

#1

Updated by George Wilson almost 8 years ago

  • Status changed from New to In Progress
  • Assignee set to George Wilson

We are over estimating the usage because we're using the wrong indirect block size. I have a fix that will be sent out for review shortly.

#2

Updated by Christopher Siden almost 8 years ago

  • Status changed from In Progress to Closed
  • % Done changed from 0 to 100
commit bda8819
Author: George Wilson <george.wilson@delphix.com>
Date:   Wed Jan 30 12:01:51 2013

    3422 zpool create/syseventd race yield non-importable pool
    3425 first write to a new zvol can fail with EFBIG
    Reviewed by: Adam Leventhal <ahl@delphix.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Approved by: Garrett D'Amore <garrett@damore.org>

Also available in: Atom PDF