Re: Identifying which files changed between archives

Hi Scott,

Of course!  The current draft has it called

The man-page entry (again, in the draft) is:

--probability-check-file p
(c mode only)
If a file has not changed, read it anyway (with the given probability p from
0.0 to 1.0) and compare its hash(es) and trailer to the cached values.  This
might provide a warning about a failing disk if other detection methods fail.
However, this operation will make Tarsnap run considerably slower, so we do
not recommend using a high probability with large archives.

- Graham

On Tue, Jun 25, 2019 at 07:55:04AM +1000, Scott Dickinson wrote:
>    Hi Graham,
>    That definitely sounds useful. Is the proposed probability value
>    intended to be a user configurable value? I'm thinking that keeping it
>    low for normal backups as you suggested, then ramping up to a higher
>    value once every x backups. Similar idea to running deltas most of the
>    time, then a full one every period.
>    Cheers,
>    Scott
>    On 25 June 2019 7:09:54 am AEST, Graham Percival <gperciva@tarsnap.com>
>    wrote:
> Hi Scott,
> Thanks for clarifying the use case!  Colin had the idea of using the Tarsnap
> cache to detect disk errors.  Namely, if the filesystem reports that a file
> hasn't changed, there would be a random chance that the tarsnap client would
> read the file anyway, and compare the chunk hashes against the expected values
> (from previous backups).  If you wanted to be paranoid, you could specify a
> probability of 100%, but more likely you'd pick a value like 10% so that it
> didn't impact performance too much.
> This wouldn't warn you about a disk failure which changed the file
> modification time or size, but it would be perfect for a disk which flipped a
> few bits in a file.
> [1]https://github.com/Tarsnap/tarsnap/issues/19
> The good news is that I have a proof-of-concept implementation of this.
> I ended up putting it on the back-burner, but I've been looking at the code
> this morning and I still think it's plausible.  Does this sound useful?
> Cheers,
> - Graham
> On Sun, Jun 23, 2019 at 06:16:52PM +1000, Scott Dickinson wrote:
>      Thanks Colin & Jacob.
>      With several hundred Gb's of data being archived, the local tarbell
>      option is probably not going to work for me.
>      Does "tarsnap -t -f" show file modification date based on what the
>      filesystem is reporting, on when tarsnap detects a change?
>      To provide more details, I had a number of sectors on an SSD
>      silently
>      faile so I needed to identify and restore files that were corrupted
>      by
>      this evemt. The filesystem did not report any change in modification
>      date on these files, so couldn't rely on this to identify which
>      files
>      to restore. Hence my question around reporting on the files impacted
>      by
>      block changes between archives, to both identify an expected change,
>      and recover from this.
>      If tarsnap can't do this, perhaps I need to start capturing a hash
>      of
>      each file at the time of backup, and compare those between archives.
>      Cheers,
>      Scott
>      On 19/6/19 7:15 am, Jacob Larsen wrote:
>      I had the same issue a while back. I was told it was not easily
>      fixed due to the layers in Tarsnap. I ended up making a regular
>      tarball and fed that to tarsnap. That way I had a local tarball that
>      matched the actual data in the archive. Then I could extract it and
>      compere at the next backup. A bit data heavy process but it gave me
>      what I needed. It is scriptable, so it is possible to let your
>      backup script log the changed files on each backup run, but it has a
>      pretty high cost in disk I/O, plus you need to keep a copy of your
>      data around between backups.
>      /Jacob
>      On 18/06/2019 13.49, Scott Dickinson wrote:
>      Hi,
>      I'm trying to work out how to generate a report on files that are
>      new or changed in a particular archive. I can't seem to find an easy
>      way to do this, so hoping someone can help.
>      Here is the scenario I'm working through.
>      1. Backup directory "x" on 1st May 2019. First time archive, all
>      10Gb are sent as expected.
>      2. Backup directory "x" on 1st June 2019. Second time archive, 25Mb
>      are sent.
>      How can I report on which files that 25Mb of delta's are part of? In
>      this scenario, I wasn't expecting any changes to the files over the
>      month, so am surprised there were anything above the metadata to be
>      backed up. My understanding is that Tarsnap needs to know which
>      files the changed blocks belong to, therefore in theory this
>      metadata should be extractable.
>      The closest I've found to locate this is "tarnsap -t -f 'x' -v
>      --iso-dates", but this doesn't natively provide the details I'm
>      after. Ideally I'd like tarsnap to be able to report which files
>      were uploaded at the time or archive with an option similar to
>      --print-stats.
>      Anyone got any ideas?
>      Cheers,
>      Scott
>    --
