[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: de-duplication detail



On 05/07/14 00:29, Arthur Chance wrote:
> On 07/05/2014 01:41, The Farmer wrote:
>> Is there any documentation on how de-duplication works?
>>
>> From what I understand, each archive is split into chunks, which are hashed
>> then encrypted (or encrypted then hashed, perhaps) and uploaded to the
>> server, and two chunks with the same hash are considered duplicates, saving
>> an upload.
>>
>> I'm wondering how the boundaries between chuncks is established.  If my
>> first upload is chunked as (BC)(DEFG)(HIJ) and my 2nd (after inserting an A
>> at the start) as (AB)(CDEF)(GHIJ) then none of the chunks will be the same,
>> and the whole file will need to be uploaded again even though the change to
>> the file was tiny.
>>
>> ... so I guess that's not how it works, and I'm left what kind of
>> cleverness is in play here.

Tarsnap uses context-dependent block boundaries to avoid exactly this problem.
I talked about this as part of my EuroBSDCon 2013 talk -- see slides 7-12 of
http://www.daemonology.net/papers/EuroBSDCon13.pdf for some details.

> I'm sure Colin will give you a more detailed answer but I'm fairly certain he's
> said it's a variant of the rsync algorithm. Take a look at
> 
> http://rsync.samba.org/tech_report/
> 
> for more details of the original.

No, not at all.  It is however related to the *rsyncable* option to gzip.

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid