[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tarsnap "de-duplication"
Hi!
Revisiting a mail I sent some months back, and have a some more questions.
Would just like to now how this service ticks under the hood -- sure is a lot of advanced tech here and would love to learn how it works :-)
On 6. okt. 2010, at 15.18, Colin Percival wrote:
>> How does Tarsnap secure that the data is not duplicated when a new version of a file will change the
>> whole structure of the file making it hard to "line up" the fingerprints of earlier backup blocks with the fingerprint of
>> the current file blocks (i.e a part of the file that was found in one file block might now be pushed into two separate blocks, causing
>> a similar cascade in the rest of the file)?
>
> Tarsnap uses variable-length blocks in order to solve this problem. If you
> add or remove data in the middle of a file, the sequence of block boundaries
> will "resynchronize" quickly.
Since the fingerprints (md5, sha1, sha256 ?) of the blocks are all Tarsnap knows about the underlying data,
and Tarsnap therefore have no way of finding previous stored data by searching each of the files being backed up trying to find earlier patterns,
I assume that Tarnsnap has to do the following when reaching an area of the file that has not been backed up previously:
1) Move in one bit from last backed up block.
2) Grab one block of data (64 kB) from that point on into the file.
3) Fingerprint that block, and check if the block is stored.
4) Rinse and repeat until you find common ground again.
Based on what Tarsnap knows about the data, I assume this is the only way to do it, but I see two problems:
A) I would think it would be relatively slow for big changes?
B) Variable length blocks: Since the blocks can vary in size it would be difficult to know how much
you can move into a new file to create a new fingerprint (see 2 above).
I would be happy to hear something about the issues described above! :-)
Thanks!
Richard Taubo