Re: Client-side deduplication during extraction

On Sun, Nov 19, 2017 at 8:03 PM, Colin Percival <cperciva@tarsnap.com> wrote:

On 11/19/17 12:37, Robie Basak wrote:
> On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:
>> On 04/04/17 13:06, Robie Basak wrote:
>>> Since the redundancy is there and my client has all the details,
>>> is there any way I can take advantage of this?
>>
>> Not right now. This is something I've been thinking about implementing,
>> but it's rather complicated (the tarsnap "read" path would need to look at
>> data on disk to see what it can "reuse", and normally it doesn't read any
>> files from disk).
>
> In case it helps others, I hacked together a client-side cache for this
> one task. It appears to have worked. Patch below.

Ah yes, I was thinking in terms of "notice that we're extracting the file
'foo' and there is already a file 'foo', then read that file in and split
it into blocks in case any can be reused" -- the case you've covered here
of keeping a cache of downloaded blocks is much simpler (but only covers
the "multiple downloads of the same data" case, not the more general case
of "synchronizing" a system with an archive).

> This is absolutely a hack and not production ready (no concurrency, bad
> error handling, hardcoded cache path whose directory must be created in
> advance and permissions set manually, etc), but for a one-off task it
> was enough for me to get my data out.
> [snip patch]

Yes, this patch definitely looks like it does what you want. I'd consider
including it (well, with details tidied up) but I'm not sure if anyone else
would want to use this functionality... anyone else on the list interested?

--
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid