[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tarsnap "de-duplication"
On 03/10/11 06:20, Henrik Nordvik wrote:
> On Thu, Mar 10, 2011 at 2:46 PM, Colin Percival <cperciva@tarsnap.com> wrote:
>> Nope. Tarsnap's variable-length blocks are generated in a data-dependent way;
>> the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
>> byte at a time, and after each block "tastes" the data to decide if it has
>> reached a good place to end the current block. (Technically, it's looking for
>> regions where the partial sums of a power series repeat a value.)
>
> Have you done any testing as to determine the improvement compared
> to e.g. Rabin fingerprinting? (The method is used in a lot of other systems,
> e.g. [1]. Not strictly Rabin fingerprinting, but close enough.)
Tarsnap's chunkification is more robust in presence of low-entropy data
streams; that is, it's less likely that you'll be "unlucky" and have a
frequently-occurring substring trigger the end of a chunk.
Aside from that, the performance is fairly similar.
--
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid