[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Database backup and deduplication question
On 12/19/11 14:59, Greg Larkin wrote:
> I'm using tarsnap to back up some large MySQL database dump files, and
> at the moment, they are compressed prior to backup. I know that means
> I'll have to push the maximum amount of data each day, so I'm looking to
> reduce the time and bandwidth it takes to store each one.
Right, compression (at least, if it isn't done with something like the
'rsyncable' option) breaks deduplication.
> I assume that if I generate a uncompressed full database backup file of
> ~1GB each day and only ~5MB of the contents change, tarsnap recognizes
> that and sends only the changed data.
Assuming the 5MB is in large enough chunks, yes. If you've got every 200th
byte changing, tarsnap won't be able to find any blocks it recognizes even
though the total number of bytes changing is small.
> My questions are:
>
> 1) Does that assumption hold even if the filename changes each day?
Yes, the deduplication is done across all the data stored using the same
set of keys.
> 2) Does that assumption hold if the filesystem inodes of the backup file
> change each day?
Yes, for the same reason. (Having the name, inode #, size, or mtime change
will result in Tarsnap reading the entire file from disk in order to look
for changes, of course, but it won't upload bits it recognizes as having
been previously uploaded.)
> 3) Does tarsnap recognize what data to send if only a small amount in
> the middle of the file changes?
It should, yes.
> I tested some scenarios by generating a 100MB file of random data. I
> tarsnapped several times, and obviously, the full file is only
> transmitted the first time. I then copied the file to a new name and
> tarsnapped again. The file is not transmitted because it's identical to
> the original file, so it appears #1 and #2 are true.
>
> Next I changed the new file so I had 25MB of identical data at the
> beginning of the file, 50MB of different data in the middle, and 25MB of
> identical data at the end. I tarsnapped again, and this time, I saw:
>
> Total size Compressed size
> All archives 524618980 527202985
> (unique data) 183722213 184630527
> deduptest5 104924533 105446860
> (unique data) 78795582 79187867
>
> Tarsnap reports 75MB of unique data in this archive, instead of 50MB.
That's very weird. I just did my own test with two 100 MB files which
were the same aside from their middle 50MB and I got the expected result
(200 MB total archive size, 150 MB post-deduplication):
# dd if=/dev/urandom bs=1M count=25 of=part.1
# dd if=/dev/urandom bs=1M count=50 of=part.2a
# dd if=/dev/urandom bs=1M count=50 of=part.2b
# dd if=/dev/urandom bs=1M count=25 of=part.3
# cat part.1 part.2a part.3 > file.a
# cat part.1 part.2b part.3 > file.b
# tarsnap --keyfile tarsnap.key --dry-run -c -f foo file.a file.b
[...]
This archive 209847044 210883514
New data 157539099 158316069
> Is that due to the design of the chunking algorithm and expected
> behavior? If it is, is my best option to split the dump file into parts
> that will likely remain static and the ones that will change more
> frequently?
Can you re-run the test to make sure that you really had 50 MB in common
between your two files?
--
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid