[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Database backup and deduplication question
On 12/20/11 12:05 AM, Colin Percival wrote:
> On 12/19/11 14:59, Greg Larkin wrote:
>> I'm using tarsnap to back up some large MySQL database dump files, and
>> at the moment, they are compressed prior to backup. I know that means
>> I'll have to push the maximum amount of data each day, so I'm looking to
>> reduce the time and bandwidth it takes to store each one.
> Right, compression (at least, if it isn't done with something like the
> 'rsyncable' option) breaks deduplication.
>
>> I assume that if I generate a uncompressed full database backup file of
>> ~1GB each day and only ~5MB of the contents change, tarsnap recognizes
>> that and sends only the changed data.
> Assuming the 5MB is in large enough chunks, yes. If you've got every 200th
> byte changing, tarsnap won't be able to find any blocks it recognizes even
> though the total number of bytes changing is small.
>
>> My questions are:
>>
>> 1) Does that assumption hold even if the filename changes each day?
> Yes, the deduplication is done across all the data stored using the same
> set of keys.
>
>> 2) Does that assumption hold if the filesystem inodes of the backup file
>> change each day?
> Yes, for the same reason. (Having the name, inode #, size, or mtime change
> will result in Tarsnap reading the entire file from disk in order to look
> for changes, of course, but it won't upload bits it recognizes as having
> been previously uploaded.)
>
>> 3) Does tarsnap recognize what data to send if only a small amount in
>> the middle of the file changes?
> It should, yes.
>
>> I tested some scenarios by generating a 100MB file of random data. I
>> tarsnapped several times, and obviously, the full file is only
>> transmitted the first time. I then copied the file to a new name and
>> tarsnapped again. The file is not transmitted because it's identical to
>> the original file, so it appears #1 and #2 are true.
>>
>> Next I changed the new file so I had 25MB of identical data at the
>> beginning of the file, 50MB of different data in the middle, and 25MB of
>> identical data at the end. I tarsnapped again, and this time, I saw:
>>
>> Total size Compressed size
>> All archives 524618980 527202985
>> (unique data) 183722213 184630527
>> deduptest5 104924533 105446860
>> (unique data) 78795582 79187867
>>
>> Tarsnap reports 75MB of unique data in this archive, instead of 50MB.
> That's very weird. I just did my own test with two 100 MB files which
> were the same aside from their middle 50MB and I got the expected result
> (200 MB total archive size, 150 MB post-deduplication):
>
> # dd if=/dev/urandom bs=1M count=25 of=part.1
> # dd if=/dev/urandom bs=1M count=50 of=part.2a
> # dd if=/dev/urandom bs=1M count=50 of=part.2b
> # dd if=/dev/urandom bs=1M count=25 of=part.3
> # cat part.1 part.2a part.3 > file.a
> # cat part.1 part.2b part.3 > file.b
> # tarsnap --keyfile tarsnap.key --dry-run -c -f foo file.a file.b
> [...]
> This archive 209847044 210883514
> New data 157539099 158316069
>
>> Is that due to the design of the chunking algorithm and expected
>> behavior? If it is, is my best option to split the dump file into parts
>> that will likely remain static and the ones that will change more
>> frequently?
> Can you re-run the test to make sure that you really had 50 MB in common
> between your two files?
>
Ok, I just simplified everything and used the same dd/cat commands you
have above. My test is a little different in that I am only adding
single files to each archive, instead of multiple files with a lot of
common data to the same archive. I wonder if that has something to do
with it?
Here are the annotated results:
sh-3.2# dd if=/dev/random bs=1m count=25 of=part1
25+0 records in
25+0 records out
26214400 bytes transferred in 2.965546 secs (8839654 bytes/sec)
sh-3.2# dd if=/dev/random bs=1m count=25 of=part3
25+0 records in
25+0 records out
26214400 bytes transferred in 2.621968 secs (9997986 bytes/sec)
sh-3.2# dd if=/dev/random bs=1m count=50 of=part2a
50+0 records in
50+0 records out
52428800 bytes transferred in 5.334805 secs (9827688 bytes/sec)
sh-3.2# dd if=/dev/random bs=1m count=50 of=part2b
50+0 records in
50+0 records out
52428800 bytes transferred in 5.193277 secs (10095514 bytes/sec)
sh-3.2# ls -latr part*
-rw-r--r-- 1 _unknown _unknown 26214400 Dec 20 13:15 part1
-rw-r--r-- 1 _unknown _unknown 26214400 Dec 20 13:15 part3
-rw-r--r-- 1 _unknown _unknown 52428800 Dec 20 13:16 part2a
-rw-r--r-- 1 _unknown _unknown 52428800 Dec 20 13:16 part2b
sh-3.2# md5 part*
MD5 (part1) = 82d4ebdf8ba67c848c24ed20b07109bf
MD5 (part2a) = 455f3a6133bd5e3b803ce528394984bd
MD5 (part2b) = 059e3e2fdc9d12e55c1c1495d5352f7b
MD5 (part3) = f49708b64884c7c63a7ab0cb6fdeb173
sh-3.2# cat part1 part2a part3 > testfileA
sh-3.2# cat part1 part2b part3 > testfileB
#
# Next commands are an attempt to establish that the middle 50MB of data
is different in each file
#
sh-3.2# cmp -l testfileA testfileB | head
26214401 336 112
26214402 353 334
26214403 150 356
26214404 106 25
26214405 347 155
26214406 102 5
26214407 115 355
26214408 327 216
26214409 355 353
26214410 266 340
sh-3.2# cmp -l testfileA testfileB | tail
78643191 320 362
78643192 125 25
78643193 364 142
78643194 110 247
78643195 47 4
78643196 123 202
78643197 325 266
78643198 357 207
78643199 160 354
78643200 212 336
#
# Check that files are 100MB
#
sh-3.2# wc testfile[AB]
409507 3171742 104857600 testfileA
408200 3172268 104857600 testfileB
817707 6344010 209715200 total
#
# Start with a clean slate
#
sh-3.2# tarsnap --list-archives
#
# Add the first file to an archive
#
sh-3.2# tarsnap -cvf deduptest1 testfileA
a testfileA
Total size Compressed size
All archives 104923493 105438015
(unique data) 104923493 105438015
This archive 104923493 105438015
New data 104923493 105438015
#
# Add the second file to a new archive - should be 50MB of changed data
in the middle of the file, but the "New data" line shows 75MB is uploaded.
#
sh-3.2# tarsnap -cvf deduptest2 testfileB
a testfileB
Total size Compressed size
All archives 209847586 210881190
(unique data) 183683097 184588451
This archive 104924093 105443175
New data 78759604 79150436
#
# Create a third archive with the original file - minimal uploading as
expected
#
sh-3.2# tarsnap -cvf deduptest3 testfileA
a testfileA
Total size Compressed size
All archives 314771079 316319205
(unique data) 183683582 184589528
This archive 104923493 105438015
New data 485 1077
#
# Create a fourth archive with the second file - minimal uploading as
expected
#
sh-3.2# tarsnap -cvf deduptest4 testfileB
a testfileB
Total size Compressed size
All archives 419695172 421762380
(unique data) 183684067 184590605
This archive 104924093 105443175
New data 485 1077
Let me know if you have any questions or need any other information.
Thank you,
Greg