[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tarsnap "de-duplication"



On 03/10/11 04:43, Richard wrote:
> On 6. okt. 2010, at 15.18, Colin Percival wrote:
>> Tarsnap uses variable-length blocks in order to solve this problem.  If you
>> add or remove data in the middle of a file, the sequence of block boundaries
>> will "resynchronize" quickly.
> 
> Since the fingerprints (md5, sha1, sha256 ?) of the blocks are all Tarsnap knows about the underlying data,
> and Tarsnap therefore have no way of finding previous stored data by searching each of the files being backed up trying to find earlier patterns,
> I assume that Tarnsnap has to do the following when reaching an area of the file that has not been backed up previously:
> 	1) Move in one bit from last backed up block.
> 	2) Grab one block of data (64 kB) from that point on into the file.
> 	3) Fingerprint that block, and check if the block is stored.
> 	4) Rinse and repeat until you find common ground again.

Nope.  Tarsnap's variable-length blocks are generated in a data-dependent way;
the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
byte at a time, and after each block "tastes" the data to decide if it has
reached a good place to end the current block.  (Technically, it's looking for
regions where the partial sums of a power series repeat a value.)

Tarsnap doesn't consult its list of already-existing blocks until after it has
generated each possibly-new block.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid