[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tarsnap "de-duplication"
On 03/10/11 04:43, Richard wrote:
> On 6. okt. 2010, at 15.18, Colin Percival wrote:
>> Tarsnap uses variable-length blocks in order to solve this problem. If you
>> add or remove data in the middle of a file, the sequence of block boundaries
>> will "resynchronize" quickly.
>
> Since the fingerprints (md5, sha1, sha256 ?) of the blocks are all Tarsnap knows about the underlying data,
> and Tarsnap therefore have no way of finding previous stored data by searching each of the files being backed up trying to find earlier patterns,
> I assume that Tarnsnap has to do the following when reaching an area of the file that has not been backed up previously:
> 1) Move in one bit from last backed up block.
> 2) Grab one block of data (64 kB) from that point on into the file.
> 3) Fingerprint that block, and check if the block is stored.
> 4) Rinse and repeat until you find common ground again.
Nope. Tarsnap's variable-length blocks are generated in a data-dependent way;
the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
byte at a time, and after each block "tastes" the data to decide if it has
reached a good place to end the current block. (Technically, it's looking for
regions where the partial sums of a power series repeat a value.)
Tarsnap doesn't consult its list of already-existing blocks until after it has
generated each possibly-new block.
--
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid