[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Identifying which files changed between archives



Hi Scott,

Thanks for clarifying the use case!  Colin had the idea of using the Tarsnap
cache to detect disk errors.  Namely, if the filesystem reports that a file
hasn't changed, there would be a random chance that the tarsnap client would
read the file anyway, and compare the chunk hashes against the expected values
(from previous backups).  If you wanted to be paranoid, you could specify a
probability of 100%, but more likely you'd pick a value like 10% so that it
didn't impact performance too much.

This wouldn't warn you about a disk failure which changed the file
modification time or size, but it would be perfect for a disk which flipped a
few bits in a file.
https://github.com/Tarsnap/tarsnap/issues/19

The good news is that I have a proof-of-concept implementation of this.
I ended up putting it on the back-burner, but I've been looking at the code
this morning and I still think it's plausible.  Does this sound useful?

Cheers,
- Graham

On Sun, Jun 23, 2019 at 06:16:52PM +1000, Scott Dickinson wrote:
>    Thanks Colin & Jacob.
>    With several hundred Gb's of data being archived, the local tarbell
>    option is probably not going to work for me.
>    Does "tarsnap -t -f" show file modification date based on what the
>    filesystem is reporting, on when tarsnap detects a change?
>    To provide more details, I had a number of sectors on an SSD silently
>    faile so I needed to identify and restore files that were corrupted by
>    this evemt. The filesystem did not report any change in modification
>    date on these files, so couldn't rely on this to identify which files
>    to restore. Hence my question around reporting on the files impacted by
>    block changes between archives, to both identify an expected change,
>    and recover from this.
>    If tarsnap can't do this, perhaps I need to start capturing a hash of
>    each file at the time of backup, and compare those between archives.
>    Cheers,
>    Scott
> 
>    On 19/6/19 7:15 am, Jacob Larsen wrote:
> 
>      I had the same issue a while back. I was told it was not easily
>      fixed due to the layers in Tarsnap. I ended up making a regular
>      tarball and fed that to tarsnap. That way I had a local tarball that
>      matched the actual data in the archive. Then I could extract it and
>      compere at the next backup. A bit data heavy process but it gave me
>      what I needed. It is scriptable, so it is possible to let your
>      backup script log the changed files on each backup run, but it has a
>      pretty high cost in disk I/O, plus you need to keep a copy of your
>      data around between backups.
>      /Jacob
>      On 18/06/2019 13.49, Scott Dickinson wrote:
> 
>      Hi,
>      I'm trying to work out how to generate a report on files that are
>      new or changed in a particular archive. I can't seem to find an easy
>      way to do this, so hoping someone can help.
>      Here is the scenario I'm working through.
>      1. Backup directory "x" on 1st May 2019. First time archive, all
>      10Gb are sent as expected.
>      2. Backup directory "x" on 1st June 2019. Second time archive, 25Mb
>      are sent.
>      How can I report on which files that 25Mb of delta's are part of? In
>      this scenario, I wasn't expecting any changes to the files over the
>      month, so am surprised there were anything above the metadata to be
>      backed up. My understanding is that Tarsnap needs to know which
>      files the changed blocks belong to, therefore in theory this
>      metadata should be extractable.
>      The closest I've found to locate this is "tarnsap -t -f 'x' -v
>      --iso-dates", but this doesn't natively provide the details I'm
>      after. Ideally I'd like tarsnap to be able to report which files
>      were uploaded at the time or archive with an option similar to
>      --print-stats.
>      Anyone got any ideas?
>      Cheers,
>      Scott