[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Database backup and deduplication question
Sorry about the late reply... I've been a bit busy with FreeBSD stuff the
past couple of days.
On 12/21/11 08:32, Greg Larkin wrote:
> First, let me note that I'm running tarsnap 1.0.31 on Mac OS X 10.6.8,
> and it was installed with MacPorts (latest). The standard llvm-gcc-4.2
> hangs during the build, so this executable is compiled with clang:
>
> sh-3.2# clang --version
> Apple clang version 2.0 (tags/Apple/clang-137) (based on LLVM 2.9svn)
> Target: x86_64-apple-darwin10
> Thread model: posix
>
> This probably doesn't matter, because up until this morning, I was using
> tarsnap 1.0.28 that was compiled with gcc-4.2, and the results haven't
> changed, but I thought I would mention it.
If compiling using gcc vs. clang made a difference, it would mean that there
was something very broken... I'm glad that didn't fix the problem. :-)
> In both cases, the SSE2 option is enabled.
That won't affect anything; SSE2 is only used in the scrypt (passphrased
key files) code.
> + tarsnap --dry-run -cvf testarchive file1 part1 part2 part3
> a file1
> a part1
> a part2
> a part3
> Total size Compressed size
> All archives 209848583 210891785
> (unique data) 138154492 138839900
> This archive 209848583 210891785
> New data 138154492 138839900
Very weird.
> I checked tarsnap.conf, and the default "checkpoint-bytes 1G" directive
> is present.
Ok, that shouldn't affect anything -- with that value the checkpoint
creation code won't be triggered.
> I also realized that my cache directory was left over from previous
> tests where I actually uploaded the random data, then nuked/fscked once
> I had finished. Even still, there's a cache file in the cache
> directory, although I assume it's functionally empty. Just for grins, I
> renamed the cache directory and ran the test again without changing the
> data files:
> [...]
> Everything is identical as before, so the caching is consistent in that
> respect.
Thanks, that rules out some possible explanations... doesn't tell me what
the actual problem is, though. :-/
> Finally, I ran tarsnap with multiple times, moving "file1" later in the
> chain each time. The unique data number was always virtually the same
> as what's shown above, but interestingly, I did receive different "This
> archive/total size" values:
>
> file1 part1 part2 part3: 209848583
> part1 file1 part2 part3: 209848623
> part1 part2 file1 part3: 209848623
> part1 part2 part3 file1: 209848583
>
> Is that significant?
That 40 byte difference is caused by the number of chunks changing by 1
(there's a 40 byte chunk header of "overhead" for each chunk). Nothing
particularly surprising there.
Can you try
# tarsnap --dry-run -cvf testarchive file1 file1
# tarsnap --dry-run -cvf testarchive part1 part1
# tarsnap --dry-run -cvf testarchive part2 part2
# tarsnap --dry-run -cvf testarchive part3 part3
You should get a perfect 2:1 deduplication ratio (modulo overhead) storing the
same file twice... but of course you should have gotten a 2:1 ratio when storing
the file and its separate parts too, so I'd like to see if this works properly.
--
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid