Feature #2019

ZFS Injection of "checkpoints" (ex-"snapshots") to facilitate incremental zfs send in smaller chunks

Added by Jim Klimov over 8 years ago. Updated about 8 years ago.

zfs - Zettabyte File System
Start date:
Due date:
% Done:


Estimated time:
Gerrit CR:


Over the past months I proposed some ideas in the zfs-discuss list, which received some praise and some constructive criticism. I'd like to summarize here to gain attention of those developers who would like to explore this path and/or check the mark when "it is done" :)
In particular, Matt Ahrens wrote [2] that Delphix are working on a similar feature.

The basic idea was to add restartability to "ZFS Send/ZFS Recv" mechanism in order to work around unreliable connections which would make it impossible to "atomically" send large sets of data (say, a 1Tb initial replication over a GPRS modem, to be extreme). This is especially important for the initial streams, because the idea to zfs-replicate a dataset may come after some time it was in the field, and possibly without the auto-zfs-snapshot service in place, so it contains lots of data by the time you create its first snapshot.
It was also shown in other users' practice that rolling back a partially-received huge snapshot could bring Solaris to its knees in form of hangs and/or kernel panics, resulting in days of downtime. After which the zfs-send should be retried again, if desired.
My original proposal was to provide a mechanism to inject snapshots or a similar mechanism (generally, "checkpoints") to keep track of zfs send's progress during a send - so that if the send is aborted, it can be restarted from the same point by both sender and receiver, like any incremental-snapshot send would. Alternatively, such "checkpoints" could be preassigned by the user based on his knowledge of his networking and other conditions, i.e. every 100Mb of data stream.
The key here is the ability to inject such "checkpoints" into preexisting data on the disk, on both sides of the communication (rationale below) at exactly predefined locations/data offsets, so that equivalent ranges of data would be marked by such checkpoints. A TXG number might be used for such definition, but that's likely to differ between systems (and TXG boundaries separating the data offsets might as well differ?). There's also a matter of changing block-sizes (i.e. sending of many small blocks may be received and coalesced as one big block), so it may take some iterations to find a "common" block boundary.

In the mail discussions, Matt Ahrens validly pointed out [5] that such injected constructs ("checkpoints") can not be considered true ZFS snapshots - because there is no associated history of data changes and deletion. Preexisting on-disk data, be it a live dataset or a snapshot, already aggregates its history since the previous snapshot (or dataset creation time) up to the pool's current TXG number for a live dataset (or this current snapshot's birth TXG number) and discards intermediate short-lived changes. Thus an injected"checkpoint", if viewed as a snapshot, would only include additions of data blocks with TXG numbers larger than the checkpoint's "creation TXG". In particular, these constructs should not become bases for clones, etc. like normal snapshots would (as I originally proposed in email discussions).

Still, I believe that for the purpose of ZFS-send these "checkpoints" can serve a similar role to snapshots in order to enable incremental send of smaller data chunks which can atomically succeed or fail - with failure having a smaller impact on both the overall process of sending the data with retries, and on the receiving systems' cleanup chores (and possible downtime or performance degradation while rolling back the failed chunk).

Other usecases for injected "checkpoints" can include recovery from corrupt on-disk data on one of the systems which replicate via zfs-send (more on that in a later RFE for replacement of intermediate snapshots, or in the usecases email [4]).

Other usecases for injected fully-fledged snapshots can include creation (or fixation) of rollback TXGs, if deferred reuse of freed blocks is implemented (see RFE #2014). This would allow admins to recover from pool-wide mistakes made within about 10 minutes ago (128*5sec), including dataset deletion. Basically, that would rely on on-disk existence of some previous uberblocks (available in the ring) and of all metadata and data blocks referred from those past uberblocks.

Links: discussion threads
[1] - Original discussion about restartable ZFS send "(Incremental) ZFS SEND at sub-snapshot level"
[3] - newer proposal about injection of snapshots "Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones"
[4] - some usecases to support the proposal



Updated by Matthew Ahrens about 8 years ago

This bug should be closed as a duplicate of 2605, but I can't see how to do that through this interface.


Updated by Garrett D'Amore about 8 years ago

  • Status changed from New to Closed

Closed . dup.

Also available in: Atom PDF