[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: de-duplication detail



On 07/05/2014 10:39, Colin Percival wrote:
On 05/07/14 00:29, Arthur Chance wrote:
On 07/05/2014 01:41, The Farmer wrote:
Is there any documentation on how de-duplication works?

 From what I understand, each archive is split into chunks, which are hashed
then encrypted (or encrypted then hashed, perhaps) and uploaded to the
server, and two chunks with the same hash are considered duplicates, saving
an upload.

I'm wondering how the boundaries between chuncks is established.  If my
first upload is chunked as (BC)(DEFG)(HIJ) and my 2nd (after inserting an A
at the start) as (AB)(CDEF)(GHIJ) then none of the chunks will be the same,
and the whole file will need to be uploaded again even though the change to
the file was tiny.

... so I guess that's not how it works, and I'm left what kind of
cleverness is in play here.

Tarsnap uses context-dependent block boundaries to avoid exactly this problem.
I talked about this as part of my EuroBSDCon 2013 talk -- see slides 7-12 of
http://www.daemonology.net/papers/EuroBSDCon13.pdf for some details.

I'm sure Colin will give you a more detailed answer but I'm fairly certain he's
said it's a variant of the rsync algorithm. Take a look at

http://rsync.samba.org/tech_report/

for more details of the original.

No, not at all.  It is however related to the *rsyncable* option to gzip.


Ah, confused the two. At my age memory isn't what it, umm, thingy.