[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tarsnap "de-duplication"



On 10/05/10 08:25, Richard wrote:
> I have read that the tarsnap client aims at producing block sizes of 64kB (compressed to 30 kB), 

The average block size is about 64 kB, and the average block size after
compression is about 30 kB.

> I assume then that 30kB is the size of the individual file blocks, encrypted and compressed, stored at Amazon S3.

The blocks aren't stored directly to S3 -- they're stored in a synthetic
S3-backed log-structured filesystem, which allows multiple blocks to be
stored in a single S3 PUT -- but that's just an implementation detail.

> I know that the Server/S3 does not know anything about the content of the files, and that the content of each
> file block is both encrypted and compressed. I therefore assume that Tarsnap is checking the
> fingerprint of every file block on Server/S3 against the corresponding file block fingerprint on the client,
> using something similar to MD5 for fingerprinting, when deciding if it will back up a given file block or not.

Tarsnap keeps a cache locally telling it which blocks it has previously
uploaded; querying the server for every block would be much too slow (and
reveal more information to the server about how archives are changing over
time).

> I am assuming that the fingerprint is of the unencrypted and uncompressed file block, but I am just guessing :-).

Correct.

> How does Tarsnap secure that the data is not duplicated when a new version of a file will change the 
> whole structure of the file making it hard to "line up" the fingerprints of earlier backup blocks with the fingerprint of
> the current file blocks (i.e a part of the file that was found in one file block might now be pushed into two separate blocks, causing
> a similar cascade in the rest of the file)?

Tarsnap uses variable-length blocks in order to solve this problem.  If you
add or remove data in the middle of a file, the sequence of block boundaries
will "resynchronize" quickly.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid