[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Client-side deduplication during extraction


I'd like to retrieve and permanently archive (offline) a full set of
archives stored with one particular key using Tarsnap.

These are of course deduplicated at Tarsnap's end. But if I download
them one at at time (using something like "tarsnap --list-archives|xargs
tarsnap -r ..." for example), it'll cost me a ton of bandwidth - both at
my end which is metered, and in Tarsnap's bandwidth charges.

I'd like my bandwith bill to be the "Compressed size/(unique data)"
figure from --print-stats, not the "Compressed size/All archives"
figure. Since the redundancy is there and my client has all the details,
is there any way I can take advantage of this?

If not, then I am planning to use an us-east-1 EC2 instance so that at
least the Tarsnap server<->client bandwidth is in one place. I can then
use that machine to deduplicate and then the download to my machine here
can at least be efficient. In this case, will I still end up being
billed by Tarsnap for the "Compressed size/All archives" figure?