[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: de-duplication detail

To: The Farmer <are.you.the.farmer@gmail.com>, tarsnap-users@tarsnap.com
Subject: Re: de-duplication detail
From: Arthur Chance <freebsd@qeng-ho.org>
Date: Wed, 07 May 2014 08:29:11 +0100
In-reply-to: <CAK=48n7b-CdA1H4qq-=96mvKkaTw3C-4zYE-f7EjTcLT5P2a=A@mail.gmail.com>
References: <CAK=48n7b-CdA1H4qq-=96mvKkaTw3C-4zYE-f7EjTcLT5P2a=A@mail.gmail.com>

On 07/05/2014 01:41, The Farmer wrote:

Is there any documentation on how de-duplication works?

 From what I understand, each archive is split into chunks, which are hashed
then encrypted (or encrypted then hashed, perhaps) and uploaded to the
server, and two chunks with the same hash are considered duplicates, saving
an upload.

I'm wondering how the boundaries between chuncks is established.  If my
first upload is chunked as (BC)(DEFG)(HIJ) and my 2nd (after inserting an A
at the start) as (AB)(CDEF)(GHIJ) then none of the chunks will be the same,
and the whole file will need to be uploaded again even though the change to
the file was tiny.

... so I guess that's not how it works, and I'm left what kind of
cleverness is in play here.

I'm sure Colin will give you a more detailed answer but I'm fairlycertain he's said it's a variant of the rsync algorithm. Take a look at


http://rsync.samba.org/tech_report/

for more details of the original.

Follow-Ups:
- Re: de-duplication detail
  - From: Colin Percival <cperciva@tarsnap.com>

References:
- de-duplication detail
  - From: The Farmer <are.you.the.farmer@gmail.com>

Prev by Date: Re: splitting key across machines
Next by Date: Re: what is the purpose of a checkpoint?
Previous by thread: de-duplication detail
Next by thread: Re: de-duplication detail
Index(es):
- Date
- Thread