[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Database backup and deduplication question



On 12/20/11 12:05 AM, Colin Percival wrote:
> On 12/19/11 14:59, Greg Larkin wrote:
>> I'm using tarsnap to back up some large MySQL database dump files, and
>> at the moment, they are compressed prior to backup.  I know that means
>> I'll have to push the maximum amount of data each day, so I'm looking to
>> reduce the time and bandwidth it takes to store each one.
> Right, compression (at least, if it isn't done with something like the
> 'rsyncable' option) breaks deduplication.
>
>> I assume that if I generate a uncompressed full database backup file of
>> ~1GB each day and only ~5MB of the contents change, tarsnap recognizes
>> that and sends only the changed data.
> Assuming the 5MB is in large enough chunks, yes.  If you've got every 200th
> byte changing, tarsnap won't be able to find any blocks it recognizes even
> though the total number of bytes changing is small.
>
>> My questions are:
>>
>> 1) Does that assumption hold even if the filename changes each day?
> Yes, the deduplication is done across all the data stored using the same
> set of keys.
>
>> 2) Does that assumption hold if the filesystem inodes of the backup file
>> change each day?
> Yes, for the same reason.  (Having the name, inode #, size, or mtime change
> will result in Tarsnap reading the entire file from disk in order to look
> for changes, of course, but it won't upload bits it recognizes as having
> been previously uploaded.)
>
>> 3) Does tarsnap recognize what data to send if only a small amount in
>> the middle of the file changes?
> It should, yes.
>
>> I tested some scenarios by generating a 100MB file of random data.  I
>> tarsnapped several times, and obviously, the full file is only
>> transmitted the first time.  I then copied the file to a new name and
>> tarsnapped again.  The file is not transmitted because it's identical to
>> the original file, so it appears #1 and #2 are true.
>>
>> Next I changed the new file so I had 25MB of identical data at the
>> beginning of the file, 50MB of different data in the middle, and 25MB of
>> identical data at the end.  I tarsnapped again, and this time, I saw:
>>
>>                                        Total size  Compressed size
>> All archives                            524618980        527202985
>>   (unique data)                         183722213        184630527
>> deduptest5                              104924533        105446860
>>   (unique data)                          78795582         79187867
>>
>> Tarsnap reports 75MB of unique data in this archive, instead of 50MB.
> That's very weird.  I just did my own test with two 100 MB files which
> were the same aside from their middle 50MB and I got the expected result
> (200 MB total archive size, 150 MB post-deduplication):
>
> # dd if=/dev/urandom bs=1M count=25 of=part.1
> # dd if=/dev/urandom bs=1M count=50 of=part.2a
> # dd if=/dev/urandom bs=1M count=50 of=part.2b
> # dd if=/dev/urandom bs=1M count=25 of=part.3
> # cat part.1 part.2a part.3 > file.a
> # cat part.1 part.2b part.3 > file.b
> # tarsnap --keyfile tarsnap.key --dry-run -c -f foo file.a file.b
> [...]
> This archive                            209847044        210883514
> New data                                157539099        158316069
>
>> Is that due to the design of the chunking algorithm and expected
>> behavior?  If it is, is my best option to split the dump file into parts
>> that will likely remain static and the ones that will change more
>> frequently?
> Can you re-run the test to make sure that you really had 50 MB in common
> between your two files?
>
Ok, I just simplified everything and used the same dd/cat commands you
have above.  My test is a little different in that I am only adding
single files to each archive, instead of multiple files with a lot of
common data to the same archive.  I wonder if that has something to do
with it?

Here are the annotated results:

sh-3.2# dd if=/dev/random bs=1m count=25 of=part1
25+0 records in
25+0 records out
26214400 bytes transferred in 2.965546 secs (8839654 bytes/sec)
sh-3.2# dd if=/dev/random bs=1m count=25 of=part3
25+0 records in
25+0 records out
26214400 bytes transferred in 2.621968 secs (9997986 bytes/sec)
sh-3.2# dd if=/dev/random bs=1m count=50 of=part2a
50+0 records in
50+0 records out
52428800 bytes transferred in 5.334805 secs (9827688 bytes/sec)
sh-3.2# dd if=/dev/random bs=1m count=50 of=part2b
50+0 records in
50+0 records out
52428800 bytes transferred in 5.193277 secs (10095514 bytes/sec)
sh-3.2# ls -latr part*
-rw-r--r--  1 _unknown  _unknown  26214400 Dec 20 13:15 part1
-rw-r--r--  1 _unknown  _unknown  26214400 Dec 20 13:15 part3
-rw-r--r--  1 _unknown  _unknown  52428800 Dec 20 13:16 part2a
-rw-r--r--  1 _unknown  _unknown  52428800 Dec 20 13:16 part2b
sh-3.2#  md5 part*
MD5 (part1) = 82d4ebdf8ba67c848c24ed20b07109bf
MD5 (part2a) = 455f3a6133bd5e3b803ce528394984bd
MD5 (part2b) = 059e3e2fdc9d12e55c1c1495d5352f7b
MD5 (part3) = f49708b64884c7c63a7ab0cb6fdeb173
sh-3.2# cat part1 part2a part3 > testfileA
sh-3.2# cat part1 part2b part3 > testfileB

#
# Next commands are an attempt to establish that the middle 50MB of data
is different in each file
#
sh-3.2# cmp -l testfileA testfileB | head
 26214401 336 112
 26214402 353 334
 26214403 150 356
 26214404 106  25
 26214405 347 155
 26214406 102   5
 26214407 115 355
 26214408 327 216
 26214409 355 353
 26214410 266 340
sh-3.2# cmp -l testfileA testfileB | tail
 78643191 320 362
 78643192 125  25
 78643193 364 142
 78643194 110 247
 78643195  47   4
 78643196 123 202
 78643197 325 266
 78643198 357 207
 78643199 160 354
 78643200 212 336

#
# Check that files are 100MB
#
sh-3.2# wc testfile[AB]
  409507 3171742 104857600 testfileA
  408200 3172268 104857600 testfileB
  817707 6344010 209715200 total

#
# Start with a clean slate
#
sh-3.2# tarsnap --list-archives

#
# Add the first file to an archive
#
sh-3.2# tarsnap -cvf deduptest1 testfileA
a testfileA
                                       Total size  Compressed size
All archives                            104923493        105438015
  (unique data)                         104923493        105438015
This archive                            104923493        105438015
New data                                104923493        105438015

#
# Add the second file to a new archive - should be 50MB of changed data
in the middle of the file, but the "New data" line shows 75MB is uploaded.
#
sh-3.2# tarsnap -cvf deduptest2 testfileB
a testfileB
                                       Total size  Compressed size
All archives                            209847586        210881190
  (unique data)                         183683097        184588451
This archive                            104924093        105443175
New data                                 78759604         79150436

#
# Create a third archive with the original file - minimal uploading as
expected
#
sh-3.2# tarsnap -cvf deduptest3 testfileA
a testfileA
                                       Total size  Compressed size
All archives                            314771079        316319205
  (unique data)                         183683582        184589528
This archive                            104923493        105438015
New data                                      485             1077

#
# Create a fourth archive with the second file - minimal uploading as
expected
#
sh-3.2# tarsnap -cvf deduptest4 testfileB
a testfileB
                                       Total size  Compressed size
All archives                            419695172        421762380
  (unique data)                         183684067        184590605
This archive                            104924093        105443175
New data                                      485             1077

Let me know if you have any questions or need any other information.

Thank you,
Greg