[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Client-side deduplication during extraction



On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:
> On 04/04/17 13:06, Robie Basak wrote:
> > I'd like to retrieve and permanently archive (offline) a full set of
> > archives stored with one particular key using Tarsnap.
> > 
> > These are of course deduplicated at Tarsnap's end. But if I download
> > them one at at time (using something like "tarsnap --list-archives|xargs
> > tarsnap -r ..." for example), it'll cost me a ton of bandwidth - both at
> > my end which is metered, and in Tarsnap's bandwidth charges.
> > 
> > I'd like my bandwith bill to be the "Compressed size/(unique data)"
> > figure from --print-stats, not the "Compressed size/All archives"
> > figure. Since the redundancy is there and my client has all the details,
> > is there any way I can take advantage of this?
> 
> Not right now.  This is something I've been thinking about implementing,
> but it's rather complicated (the tarsnap "read" path would need to look at
> data on disk to see what it can "reuse", and normally it doesn't read any
> files from disk).

In case it helps others, I hacked together a client-side cache for this
one task. It appears to have worked. Patch below.

This is absolutely a hack and not production ready (no concurrency, bad
error handling, hardcoded cache path whose directory must be created in
advance and permissions set manually, etc), but for a one-off task it
was enough for me to get my data out.

---
 tar/storage/storage_read.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tar/storage/storage_read.c b/tar/storage/storage_read.c
index 2c19650..62bf6b7 100644
--- a/tar/storage/storage_read.c
+++ b/tar/storage/storage_read.c
@@ -13,6 +13,7 @@
 #include "storage_internal.h"
 #include "sysendian.h"
 #include "warnp.h"
+#include "hexify.h"
 
 #include "storage.h"
 
@@ -313,6 +314,20 @@ storage_read_file(STORAGE_R * S, uint8_t * buf, size_t buflen,
 		}
 	}
 
+	int old_errno = errno;
+	char hashbuf[65];
+	hexify(name, hashbuf, 32);
+	char *cache_path;
+	if(asprintf(&cache_path, "/tmp/tarsnap-cache/%c-%s", class, hashbuf) < 0) abort();
+	FILE *fp = fopen(cache_path, "r");
+	if (fp) {
+	    if (fread(buf, buflen, 1, fp) != 1) abort();
+	    if (fclose(fp)) abort();
+	    free(cache_path);
+	    return 0;
+	} else {
+	    errno = old_errno;
+	}
 	/* Initialize structure. */
 	C.buf = buf;
 	C.buflen = buflen;
@@ -326,6 +341,13 @@ storage_read_file(STORAGE_R * S, uint8_t * buf, size_t buflen,
 		goto err0;
 
 done:
+	if (!C.status) {
+		FILE *fp = fopen(cache_path, "w");
+		if (!fp) abort();
+		if(fwrite(buf, buflen, 1, fp) != 1) abort();
+		if(fclose(fp)) abort();
+	}
+	free(cache_path);
 	/* Return status code from server. */
 	return (C.status);
 
-- 
2.7.4