Project

General

Profile

Bug #537

time-slider is snapshotting swap and dump after upgrading to oi_147

Added by Jeremy Thornhill almost 9 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Desktop (JDS)
Target version:
Start date:
2010-12-17
Due date:
2011-09-14
% Done:

0%

Estimated time:
40.00 h
Difficulty:
Expert
Tags:
time-slider

Description

As background, this is a x86 system that's been upgraded from OpenSolaris snv_133, and the time-slider configuration I'm using is unmodified post-upgrade (save for disabling the legacy time-slider-cleanup cron job).

On this system rpool/swap and rpool/dump - which are both configured with 'com.sun:auto-snapshot = false' - are being snapshotted at every possible interval (e.g. hourly, frequent, daily, etc). Additionally, it seems that time-slider is not ever cleaning up these snapshots, even when they have a "USED" value of 0. This results in hundreds of empty snapshots being created which I've had to manually remove.

Auto snapshots for other filesystems, seem to be both created and destroyed as expected, and other filesystems with 'com.sun:auto-snapshot = false' are not being snapshotted at all (as expected).

All zfs pools and filesystems have all been upgraded to the latest version. I've tried restarting the time-slider services but that seems to have no effect.

History

#1

Updated by Jeremy Thornhill almost 9 years ago

Update:

I realized that this is also happening with my old BEs (e.g. rpool/ROOT/opensolaris-x) and indeed every child filesystem of rpool.

I now believe that time-slider is erroneously creating recursive snapshots of rpool, even when some child filesystems are supposed to be excluded based on the zfs property. I tested this by setting com.sun:auto-snapshot is set to false on rpool, after which no new snapshots on these filesystems were created. When I re-enabled com.sun:auto-snapshot on rpool, snapshots are created for every child, regardless of whether com.sun:auto-snapshot is false on the individual child filesystems.

My guess (I'll try to verify this in the python code): during creation, time-slider is doing a zfs snapshot -r rpool instead of individually validating each child filesystem. However, during the destroy operation, it's (properly) checking for com.sun:auto-snapshot and not destroying any snapshots on those. The net result is that such filesystems will see an indefinitely increasing number of snapshots created and never destroyed.

Now especially oddly for me this is only happening on 'rpool' (a single IDE pool which has my OS) and not 'vault' (a raidz pool).

I suspect this is a bug in zfs.Datasets.create_auto_snapshot_set() (/usr/share/time-slider/lib/time_slider/zfs.py) but I have yet to identify it. I also assume that this is some kind of corner case, since I haven't seen anybody else report it. Maybe due to having multiple pools or something. I'll try to gather more information.

#2

Updated by Jeremy Thornhill almost 9 years ago

I have identified a bug in how time-slider determines whether to do recursive snapshots. I guess my configuration is not common so I will explain what I'm doing, and why it breaks the code.

/usr/share/time-slider/lib/time_slider/zfs.py

The snapshotting code in create_auto_snapshot_set creates an empty list ('excluded') which is to hold "excluded" datasets. This list gets populated in two passes: on the first, it scrapes the output of 'zfs list -H -t filesystem,volume -o name,com.sun:auto-snapshot:<timeperiod> -s name' (where timeperiod is the name of the run, e.g. "frequent" or "hourly") and it appends anything that's "false" to the 'excluded' list; on the second, it scrapes 'zfs list -H -t filesystem,volume -o name,com.sun:auto-snapshot -s name' and appends the "false" datasets from that into the same list.

This creates one mega list. Mine for 'frequent' looks like the following (edited for brevity):

['external',
 'external/backup',
 'external/games',
 'external/images',
 'external/music',
 'vault/backup',
 'rpool/ROOT/opensolaris-3',
 'rpool/ROOT/opensolaris-4',
 'rpool/ROOT/opensolaris-5',
 'rpool/dump',
 'rpool/swap',
 'vault/jumpstart',]

This list is correct in that all of these should be excluded from the run, but note that it's not ordered since it was created in two different calls to the zfs cli tool. The first run (checking in this case for com.sun:auto-snapshot:frequent=false) has populated the list with all of the datasets up to 'vault/backup'; from there on down, the list was populated from datasets with com.sun:auto-snapshot=false.

This in and of itself doesn't seem like it's necessarily a problem to me, but when the logic to determine whether it's ok to do zfs create -r kicks in later on, it searches through this list in a way that assumes it will be ordered; that is:

for child in children:
                idx = bisect_left(excluded, child)
                if idx < len(excluded) and excluded[idx] == child:
                    excludedchild = True
                    single.append(datasetname)
                    break

The bisect_left gets all kind of broken since the original list isn't in order any more, and it effectively will never matche against anything from the second run (which looked for com.sun:auto-snapshot=false).

The reason I suppose that this is a corner case is that it only can crop up if you're both disabling certain time-slider runs on some datasets and disabling time-slider completely on other datasets. And even if you're doing that, it may never manifest as a problem if the parent filesystems of both of those are the same.

I can make a patch to modify this behavior but I'm not sure entirely what's appropriate. Should that list be sorted? Does it matter? If it does matter, is it really safe to trust the order that's generated by the zfs list command?

#4

Updated by Julian Wiesener over 8 years ago

  • Category set to Desktop (JDS)
  • Assignee set to OI JDS
  • Difficulty set to Medium
  • Tags set to time-slider
#5

Updated by Guido Berhörster over 8 years ago

  • Target version set to oi_151
#6

Updated by Julian Wiesener over 8 years ago

  • Target version changed from oi_151 to oi_151_stable
#7

Updated by Ken Mays about 8 years ago

  • Due date set to 2011-09-14
  • Status changed from New to Closed
  • Estimated time set to 40.00 h
  • Difficulty changed from Medium to Expert

Closing ticket as possible duplicate. Time slider is maintained through upstream where we'll implemented the latest fixes in a post oi_151a JDS update. This may spawn off to an independent project from OI-JDS to better maintain and update time slider. See Bug #1013 as tracked duplicate.

Also available in: Atom PDF