[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tarsnap "de-duplication"



On 03/10/11 06:20, Henrik Nordvik wrote:
> On Thu, Mar 10, 2011 at 2:46 PM, Colin Percival <cperciva@tarsnap.com> wrote:
>> Nope.  Tarsnap's variable-length blocks are generated in a data-dependent way;
>> the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
>> byte at a time, and after each block "tastes" the data to decide if it has
>> reached a good place to end the current block.  (Technically, it's looking for
>> regions where the partial sums of a power series repeat a value.)
> 
> Have you done any testing as to determine the improvement compared
> to e.g. Rabin fingerprinting? (The method is used in a lot of other systems,
> e.g. [1]. Not strictly Rabin fingerprinting, but close enough.)

Tarsnap's chunkification is more robust in presence of low-entropy data
streams; that is, it's less likely that you'll be "unlucky" and have a
frequently-occurring substring trigger the end of a chunk.

Aside from that, the performance is fairly similar.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid