[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Database backup and deduplication question
On 12/21/11 6:36 AM, Colin Percival wrote:
> On 12/20/11 10:45, Greg Larkin wrote:
>> On 12/20/11 12:05 AM, Colin Percival wrote:
>>> That's very weird. I just did my own test with two 100 MB files which
>>> were the same aside from their middle 50MB and I got the expected result
>>> (200 MB total archive size, 150 MB post-deduplication):
>>>
>> Ok, I just simplified everything and used the same dd/cat commands you
>> have above. My test is a little different in that I am only adding
>> single files to each archive, instead of multiple files with a lot of
>> common data to the same archive. I wonder if that has something to do
>> with it?
> I just tested with exactly the same sequence of commands as you and I'm
> still seeing 100 MB for the first archive and 50 MB for the second. Are
> you using the checkpoint-bytes option? That can affect the sequence of
> chunks which a block of data gets divided into, although it shouldn't
> have this much of an effect. Aside from that slight possibility, I'm
> utterly mystified... can you try
> # dd if=/dev/random bs=1m count=25 of=part1
> # dd if=/dev/random bs=1m count=50 of=part2
> # dd if=/dev/random bs=1m count=25 of=part3
> # cat part1 part2 part3 > file1
> # tarsnap --dry-run -c file1 part1 part2 part3
> so we can see if the individual parts are getting deduplicated properly
> when they're not stuck together with other data?
>
First, let me note that I'm running tarsnap 1.0.31 on Mac OS X 10.6.8,
and it was installed with MacPorts (latest). The standard llvm-gcc-4.2
hangs during the build, so this executable is compiled with clang:
sh-3.2# clang --version
Apple clang version 2.0 (tags/Apple/clang-137) (based on LLVM 2.9svn)
Target: x86_64-apple-darwin10
Thread model: posix
This probably doesn't matter, because up until this morning, I was using
tarsnap 1.0.28 that was compiled with gcc-4.2, and the results haven't
changed, but I thought I would mention it. In both cases, the SSE2
option is enabled. I thought maybe the old version was the problem, but
I don't think so after running the following tests:
sh-3.2# sh -x ./test.sh
+ dd if=/dev/random bs=1m count=25 of=part1
25+0 records in
25+0 records out
26214400 bytes transferred in 2.510754 secs (10440847 bytes/sec)
+ dd if=/dev/random bs=1m count=50 of=part2
50+0 records in
50+0 records out
52428800 bytes transferred in 5.199607 secs (10083223 bytes/sec)
+ dd if=/dev/random bs=1m count=25 of=part3
25+0 records in
25+0 records out
26214400 bytes transferred in 2.628373 secs (9973622 bytes/sec)
+ cat part1 part2 part3
+ tarsnap --dry-run -cvf testarchive file1 part1 part2 part3
a file1
a part1
a part2
a part3
Total size Compressed size
All archives 209848583 210891785
(unique data) 138154492 138839900
This archive 209848583 210891785
New data 138154492 138839900
I checked tarsnap.conf, and the default "checkpoint-bytes 1G" directive
is present.
I also realized that my cache directory was left over from previous
tests where I actually uploaded the random data, then nuked/fscked once
I had finished. Even still, there's a cache file in the cache
directory, although I assume it's functionally empty. Just for grins, I
renamed the cache directory and ran the test again without changing the
data files:
sh-3.2# mv /opt/local/tarsnap-cache /opt/local/tarsnap-cache.SAVE
sh-3.2# tarsnap --dry-run -cvf testarchive file1 part1 part2 part3
Directory /opt/local/tarsnap-cache created for "--cachedir
/opt/local/tarsnap-cache"
a file1
a part1
a part2
a part3
Total size Compressed size
All archives 209848583 210891785
(unique data) 138154492 138839900
This archive 209848583 210891785
New data 138154492 138839900
Everything is identical as before, so the caching is consistent in that
respect.
Finally, I ran tarsnap with multiple times, moving "file1" later in the
chain each time. The unique data number was always virtually the same
as what's shown above, but interestingly, I did receive different "This
archive/total size" values:
file1 part1 part2 part3: 209848583
part1 file1 part2 part3: 209848623
part1 part2 file1 part3: 209848623
part1 part2 part3 file1: 209848583
Is that significant?
Thank you,
Greg