[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tarsnap "de-duplication"



On March,10, at 14.46, Colin Percival wrote:

> On 03/10/11 04:43, Richard wrote:
>> On 6. okt. 2010, at 15.18, Colin Percival wrote:
>>> Tarsnap uses variable-length blocks in order to solve this problem.  If you
>>> add or remove data in the middle of a file, the sequence of block boundaries
>>> will "resynchronize" quickly.
>> 
>> Since the fingerprints (md5, sha1, sha256 ?) of the blocks are all Tarsnap knows about the underlying data,
>> and Tarsnap therefore have no way of finding previous stored data by searching each of the files being backed up trying to find earlier patterns,
>> I assume that Tarnsnap has to do the following when reaching an area of the file that has not been backed up previously:
>> 	1) Move in one bit from last backed up block.
>> 	2) Grab one block of data (64 kB) from that point on into the file.
>> 	3) Fingerprint that block, and check if the block is stored.
>> 	4) Rinse and repeat until you find common ground again.
> 
> Nope.  Tarsnap's variable-length blocks are generated in a data-dependent way;
> the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
> byte at a time, and after each block "tastes" the data to decide if it has
> reached a good place to end the current block.  (Technically, it's looking for
> regions where the partial sums of a power series repeat a value.)

Thanks.
If I may ask; why is such a repeat value seen as good end value, and how is the files binary data represented as a power series?
If time allows: A tiny example of such a series would be great! :-)

> Tarsnap doesn't consult its list of already-existing blocks until after it has
> generated each possibly-new block.
Based on the info that "chunkification" happens without consulting the list of already-existing blocks,
I would assume that file changes might reorder the file to such a degree that it might be difficult to find
already existing blocks from that point on? Wondering if that would lead to a situation where the data is
already present, but chunked differently between blocks and thus not found?
In an earlier mail 'Tarsnap "de-duplication"' (October 6, 2010), you wrote that: "Tarsnap uses variable-length blocks in order to solve this problem",
but I have problems grasping how that is working if "chunkification" and checking list of already-existing blocks are divided into
two separate tasks as described above?

Thanks for info!

Richard Taubo