"ZFS-HSM" Disk-cached writes (all writes, not only sync ones like in ZIL SLOG)
Generic idea is simple: buffer writes on few devices for a long time (maybe hours or days on low-traffic hosts) instead of requiring always-on storage on many HDD devices, in order to conserve power, even if at the expense of throughput (user's choice). May also decrease fragmentation of data into smaller dispersed blocks if original writes come in small chunks.
This is reminiscent of HSM, but built into ZFS, and continuing the same philosophy -- now merging RAID, ZPL/ZVOL and (parts of) HSM into one software stack :)
Feature is aimed at users who would like to conserve power (i.e. not spin up the 8-disk array for a small browsing cache update, nor require it to keep spinning) such as home-NAS users. For example, a box might include the rpool+ZIL+L2ARC + this write buffer on 2-4 2.5" HDDs or SSDs, and the main storage pool in a series 5-to-8-disk removable racks with multiple-TB disks. Powering the main disks might require tens of watts, while the system 2-4 disks would need a lot less working continuously. In this case when there are few bytes to write, they are cached (i.e. on same mirrored devices as used for ZIL and/or rpool), until this buffer overflows and/or there is a reason to spin up the main pool vdevs.
Storage hardware requirements/recommendations may be same as ZIL SLOGs - require/strongly recommend mirrors. It is questionable whether dedicated "log" devices can or should be used for this feature. Many people have asked on the list, though, that if the ZIL can often be sized within 1-4GB, what could they use the rest of even a 20GB SLC drive? Such write-cache might be one useful answer :)Enabling the feature (getting data onto main pool HDDs):
- May be a per-dataset option is in order (setting whether to buffer writes or stream them to main pool directly) as a tradeoff for cases where performance hit is too noticeable. It might be a "yes-no" trigger or a byte-size ("how much incoming data is buffered in RAM + write-cached for this dataset?")
- Also, this "to buffer or not" decision may be triggered by last-access times of leaf vdev hardware and/or interaction with power management framework to see if the pool's HDDs are spun-down or not.
- If there is access requiring main pool VDEVs (i.e. scrubbing, or access to data not cached in L2ARC and the write-buffer devices), and the main pool is known to have spun up, the write buffer might be automatically flushed onto the main pool.
- One idea is to consider the write-cache (mirror) a special "undetacheable" part of the pool, sharing the common TXG numbers, etc. Moving the bytes from this cache to be striped over the other top-level VDEVs might be a special case of BP rewrite.
- Another idea is to work like the ZIL SLOG - when the time comes to sync the write-buffer onto the main pool, it is processed like the SLOG journal rollforward, applying transactions to the pool, or perhaps something like "zfs send|zfs recv" could happen between the devices. Increasing TXG numbers can still be used consistently with the main pool's TXG numbering. Likewise, the cache device may get corrupted or lost without breaking consistency of the pool (only losing the cached data). However there should be a special codepath to read from this write-cache before trying to access the main pool HDDs.
Further development of the HSM idea is out of this RFE's scope, but it might for example dedicate parts of the common pool located on rarely-spun physical disks grouped as top-level VDEVs to write-once-read-almost-never data such as backups, etc., while frequently-used data is located on other top-level VDEVs (instead of striping everything over all TLVDEVs). That might require BP rewrite to relocate existing rarely-accessed blocks into such rarely-spun VDEVs and back, if access frequency increases. Possibly these TLVDEVs can be made on differently performing disks (like in HSM systems).
While this is currently aimed at HDDs, this write-buffering might prove useful in the future when SSDs dominate the planet: SLC SSDs or DRAM "disks" could still be used for write cache, while cheaper-per-gigabyte MLC SSDs (faster worn out by writes) could be used to receive the write-cached bulk data in a few large writes, reducing the impact of remappings, etc.