[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Database backup and deduplication question



Sorry about the late reply... I've been a bit busy with FreeBSD stuff the
past couple of days.

On 12/21/11 08:32, Greg Larkin wrote:
> First, let me note that I'm running tarsnap 1.0.31 on Mac OS X 10.6.8,
> and it was installed with MacPorts (latest).  The standard llvm-gcc-4.2
> hangs during the build, so this executable is compiled with clang:
> 
> sh-3.2# clang --version
> Apple clang version 2.0 (tags/Apple/clang-137) (based on LLVM 2.9svn)
> Target: x86_64-apple-darwin10
> Thread model: posix
> 
> This probably doesn't matter, because up until this morning, I was using
> tarsnap 1.0.28 that was compiled with gcc-4.2, and the results haven't
> changed, but I thought I would mention it. 

If compiling using gcc vs. clang made a difference, it would mean that there
was something very broken... I'm glad that didn't fix the problem. :-)

> In both cases, the SSE2 option is enabled.

That won't affect anything; SSE2 is only used in the scrypt (passphrased
key files) code.

> + tarsnap --dry-run -cvf testarchive file1 part1 part2 part3
> a file1
> a part1
> a part2
> a part3
>                                        Total size  Compressed size
> All archives                            209848583        210891785
>   (unique data)                         138154492        138839900
> This archive                            209848583        210891785
> New data                                138154492        138839900

Very weird.

> I checked tarsnap.conf, and the default "checkpoint-bytes 1G" directive
> is present.

Ok, that shouldn't affect anything -- with that value the checkpoint
creation code won't be triggered.

> I also realized that my cache directory was left over from previous
> tests where I actually uploaded the random data, then nuked/fscked once
> I had finished.  Even still, there's a cache file in the cache
> directory, although I assume it's functionally empty.  Just for grins, I
> renamed the cache directory and ran the test again without changing the
> data files:
> [...]
> Everything is identical as before, so the caching is consistent in that
> respect.

Thanks, that rules out some possible explanations... doesn't tell me what
the actual problem is, though. :-/

> Finally, I ran tarsnap with multiple times, moving "file1" later in the
> chain each time.  The unique data number was always virtually the same
> as what's shown above, but interestingly, I did receive different "This
> archive/total size" values:
> 
> file1 part1 part2 part3:    209848583
> part1 file1 part2 part3:    209848623
> part1 part2 file1 part3:    209848623
> part1 part2 part3 file1:    209848583
> 
> Is that significant?

That 40 byte difference is caused by the number of chunks changing by 1
(there's a 40 byte chunk header of "overhead" for each chunk).  Nothing
particularly surprising there.

Can you try

# tarsnap --dry-run -cvf testarchive file1 file1
# tarsnap --dry-run -cvf testarchive part1 part1
# tarsnap --dry-run -cvf testarchive part2 part2
# tarsnap --dry-run -cvf testarchive part3 part3

You should get a perfect 2:1 deduplication ratio (modulo overhead) storing the
same file twice... but of course you should have gotten a 2:1 ratio when storing
the file and its separate parts too, so I'd like to see if this works properly.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid