[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tarsnap "de-duplication"



On 03/10/11 09:21, Richard wrote:
> On March,10, at 14.46, Colin Percival wrote:
>> Nope.  Tarsnap's variable-length blocks are generated in a data-dependent way;
>> the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
>> byte at a time, and after each block "tastes" the data to decide if it has
>> reached a good place to end the current block.  (Technically, it's looking for
>> regions where the partial sums of a power series repeat a value.)
> 
> If I may ask; why is such a repeat value seen as good end value, and how is the files binary data represented as a power series?

There's a table x[] which maps byte values to coefficients; a prime p, and a
value a between 2 and p-2.  So Tarsnap looks for positions i .. j such that
a^i * x[buf[i]] + a^(i+1) * x[buf[i+1]] + ... + a^j * x[buf[j]] mod p == 0.

> Based on the info that "chunkification" happens without consulting the list of already-existing blocks,
> I would assume that file changes might reorder the file to such a degree that it might be difficult to find
> already existing blocks from that point on? Wondering if that would lead to a situation where the data is
> already present, but chunked differently between blocks and thus not found?

Tarsnap's chunking is "convergent", in that if you modify the data stream, you
can get a few chunk boundaries which don't match up, but fairly quickly (average
about 1.6 chunks IIRC) the chunk boundaries will match up again.

> In an earlier mail 'Tarsnap "de-duplication"' (October 6, 2010), you wrote that: "Tarsnap uses variable-length blocks in order to solve this problem",
> but I have problems grasping how that is working if "chunkification" and checking list of already-existing blocks are divided into
> two separate tasks as described above?

The use of variable-length blocks makes it possible for the sequences of chunk
boundaries to converge.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid