[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap "de-duplication"

To: tarsnap-users@tarsnap.com
Subject: Tarsnap "de-duplication"
From: Richard <ort@bergersen.no>
Date: Tue, 5 Oct 2010 17:25:48 +0200
Mailing-list: contact tarsnap-users-help@tarsnap.com; run by ezmlm

Hi!

I have read that the tarsnap client aims at producing block sizes of 64kB (compressed to 30 kB), 
see: http://news.ycombinator.com/item?id=397392
I assume then that 30kB is the size of the individual file blocks, encrypted and compressed, stored at Amazon S3.

I know that the Server/S3 does not know anything about the content of the files, and that the content of each
file block is both encrypted and compressed. I therefore assume that Tarsnap is checking the
fingerprint of every file block on Server/S3 against the corresponding file block fingerprint on the client,
using something similar to MD5 for fingerprinting, when deciding if it will back up a given file block or not.
I am assuming that the fingerprint is of the unencrypted and uncompressed file block, but I am just guessing :-).

Based on all this, which might be completely wrong :-), I then wonder:
How does Tarsnap secure that the data is not duplicated when a new version of a file will change the 
whole structure of the file making it hard to "line up" the fingerprints of earlier backup blocks with the fingerprint of
the current file blocks (i.e a part of the file that was found in one file block might now be pushed into two separate blocks, causing
a similar cascade in the rest of the file)?


Thanks for input!


Richard Taubo

Follow-Ups:
- Re: Tarsnap "de-duplication"
  - From: Colin Percival <cperciva@tarsnap.com>

Prev by Date: Re: Wildcards in archive deletion
Next by Date: Re: Tarsnap "de-duplication"
Previous by thread: Wildcards in archive deletion
Next by thread: Re: Tarsnap "de-duplication"
Index(es):
- Date
- Thread