Re: Identifying which files changed between archives

On 25 June 2019 7:09:54 am AEST, Graham Percival <gperciva@tarsnap.com> wrote:
Hi Scott,

Thanks for clarifying the use case!  Colin had the idea of using the Tarsnap
cache to detect disk errors.  Namely, if the filesystem reports that a file
hasn't changed, there would be a random chance that the tarsnap client would
read the file anyway, and compare the chunk hashes against the expected values
(from previous backups).  If you wanted to be paranoid, you could specify a
probability of 100%, but more likely you'd pick a value like 10% so that it
didn't impact performance too much.

This wouldn't warn you about a disk failure which changed the file
modification time or size, but it would be perfect for a disk which flipped a
few bits in a file.
https://github.com/Tarsnap/tarsnap/issues/19

The good news is that I have a proof-of-concept implementation of this.
I ended up putting it on the back-burner, but I've been looking at the code
this morning and I still think it's plausible.  Does this sound useful?

Cheers,
- Graham

On Sun, Jun 23, 2019 at 06:16:52PM +1000, Scott Dickinson wrote:
   Thanks Colin & Jacob.
   With several hundred Gb's of data being archived, the local tarbell
   option is probably not going to work for me.
   Does "tarsnap -t -f" show file modification date based on what the
   filesystem is reporting, on when tarsnap detects a change?
   To provide more details, I had a number of sectors on an SSD silently
   faile so I needed to identify and restore files that were corrupted by
   this evemt. The filesystem did not report any change in modification
   date on these files, so couldn't rely on this to identify which files
   to restore. Hence my question around reporting on the files impacted by
   block changes between archives, to both identify an expected change,
   and recover from this.
   If tarsnap can't do this, perhaps I need to start capturing a hash of
   each file at the time of backup, and compare those between archives.
   Cheers,
   Scott

   On 19/6/19 7:15 am, Jacob Larsen wrote:

     I had the same issue a while back. I was told it was not easily
     fixed due to the layers in Tarsnap. I ended up making a regular
     tarball and fed that to tarsnap. That way I had a local tarball that
     matched the actual data in the archive. Then I could extract it and
     compere at the next backup. A bit data heavy process but it gave me
     what I needed. It is scriptable, so it is possible to let your
     backup script log the changed files on each backup run, but it has a
     pretty high cost in disk I/O, plus you need to keep a copy of your
     data around between backups.
     /Jacob
     On 18/06/2019 13.49, Scott Dickinson wrote:

     Hi,
     I'm trying to work out how to generate a report on files that are
     new or changed in a particular archive. I can't seem to find an easy
     way to do this, so hoping someone can help.
     Here is the scenario I'm working through.
     1. Backup directory "x" on 1st May 2019. First time archive, all
     10Gb are sent as expected.
     2. Backup directory "x" on 1st June 2019. Second time archive, 25Mb
     are sent.
     How can I report on which files that 25Mb of delta's are part of? In
     this scenario, I wasn't expecting any changes to the files over the
     month, so am surprised there were anything above the metadata to be
     backed up. My understanding is that Tarsnap needs to know which
     files the changed blocks belong to, therefore in theory this
     metadata should be extractable.
     The closest I've found to locate this is "tarnsap -t -f 'x' -v
     --iso-dates", but this doesn't natively provide the details I'm
     after. Ideally I'd like tarsnap to be able to report which files
     were uploaded at the time or archive with an option similar to
     --print-stats.
     Anyone got any ideas?
     Cheers,
     Scott