[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Tarsnap "de-duplication"
On Thu, Mar 10, 2011 at 2:46 PM, Colin Percival <cperciva@tarsnap.com> wrote:
> Nope. Tarsnap's variable-length blocks are generated in a data-dependent way;
> the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
> byte at a time, and after each block "tastes" the data to decide if it has
> reached a good place to end the current block. (Technically, it's looking for
> regions where the partial sums of a power series repeat a value.)
>
> Tarsnap doesn't consult its list of already-existing blocks until after it has
> generated each possibly-new block.
Have you done any testing as to determine the improvement compared
to e.g. Rabin fingerprinting? (The method is used in a lot of other systems,
e.g. [1]. Not strictly Rabin fingerprinting, but close enough.)
[1] http://pdos.csail.mit.edu/lbfs/
--
Henrik Nordvik