[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Tarsnap "de-duplication"



On Thu, Mar 10, 2011 at 2:46 PM, Colin Percival <cperciva@tarsnap.com> wrote:
> Nope.  Tarsnap's variable-length blocks are generated in a data-dependent way;
> the "chunkification" code (tar/multitape/chunkify.c) adds data to a block one
> byte at a time, and after each block "tastes" the data to decide if it has
> reached a good place to end the current block.  (Technically, it's looking for
> regions where the partial sums of a power series repeat a value.)
>
> Tarsnap doesn't consult its list of already-existing blocks until after it has
> generated each possibly-new block.

Have you done any testing as to determine the improvement compared
to e.g. Rabin fingerprinting? (The method is used in a lot of other systems,
e.g. [1]. Not strictly Rabin fingerprinting, but close enough.)

[1] http://pdos.csail.mit.edu/lbfs/

--
Henrik Nordvik