[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Client-side deduplication during extraction



Hi Robie,

On 04/04/17 13:06, Robie Basak wrote:
> I'd like to retrieve and permanently archive (offline) a full set of
> archives stored with one particular key using Tarsnap.
> 
> These are of course deduplicated at Tarsnap's end. But if I download
> them one at at time (using something like "tarsnap --list-archives|xargs
> tarsnap -r ..." for example), it'll cost me a ton of bandwidth - both at
> my end which is metered, and in Tarsnap's bandwidth charges.
> 
> I'd like my bandwith bill to be the "Compressed size/(unique data)"
> figure from --print-stats, not the "Compressed size/All archives"
> figure. Since the redundancy is there and my client has all the details,
> is there any way I can take advantage of this?

Not right now.  This is something I've been thinking about implementing,
but it's rather complicated (the tarsnap "read" path would need to look at
data on disk to see what it can "reuse", and normally it doesn't read any
files from disk).

> If not, then I am planning to use an us-east-1 EC2 instance so that at
> least the Tarsnap server<->client bandwidth is in one place. I can then
> use that machine to deduplicate and then the download to my machine here
> can at least be efficient. In this case, will I still end up being
> billed by Tarsnap for the "Compressed size/All archives" figure?

If you extract all of the archives, yes.

How are you planning on storing your data after you extract all of the
archives?  Something like ZFS which provides filesystem level deduplication?

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid