[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap "de-duplication"



Hi!

I have read that the tarsnap client aims at producing block sizes of 64kB (compressed to 30 kB), 
see: http://news.ycombinator.com/item?id=397392
I assume then that 30kB is the size of the individual file blocks, encrypted and compressed, stored at Amazon S3.

I know that the Server/S3 does not know anything about the content of the files, and that the content of each
file block is both encrypted and compressed. I therefore assume that Tarsnap is checking the
fingerprint of every file block on Server/S3 against the corresponding file block fingerprint on the client,
using something similar to MD5 for fingerprinting, when deciding if it will back up a given file block or not.
I am assuming that the fingerprint is of the unencrypted and uncompressed file block, but I am just guessing :-).

Based on all this, which might be completely wrong :-), I then wonder:
How does Tarsnap secure that the data is not duplicated when a new version of a file will change the 
whole structure of the file making it hard to "line up" the fingerprints of earlier backup blocks with the fingerprint of
the current file blocks (i.e a part of the file that was found in one file block might now be pushed into two separate blocks, causing
a similar cascade in the rest of the file)?


Thanks for input!


Richard Taubo