[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Database backup and deduplication question

To: Greg Larkin <glarkin@sourcehosting.net>
Subject: Re: Database backup and deduplication question
From: Colin Percival <cperciva@tarsnap.com>
Date: Mon, 19 Dec 2011 21:05:09 -0800
Cc: tarsnap-users@tarsnap.com
In-reply-to: <4EEFC1E6.5020606@sourcehosting.net>
References: <4EEFC1E6.5020606@sourcehosting.net>

On 12/19/11 14:59, Greg Larkin wrote:
> I'm using tarsnap to back up some large MySQL database dump files, and
> at the moment, they are compressed prior to backup.  I know that means
> I'll have to push the maximum amount of data each day, so I'm looking to
> reduce the time and bandwidth it takes to store each one.

Right, compression (at least, if it isn't done with something like the
'rsyncable' option) breaks deduplication.

> I assume that if I generate a uncompressed full database backup file of
> ~1GB each day and only ~5MB of the contents change, tarsnap recognizes
> that and sends only the changed data.

Assuming the 5MB is in large enough chunks, yes.  If you've got every 200th
byte changing, tarsnap won't be able to find any blocks it recognizes even
though the total number of bytes changing is small.

> My questions are:
> 
> 1) Does that assumption hold even if the filename changes each day?

Yes, the deduplication is done across all the data stored using the same
set of keys.

> 2) Does that assumption hold if the filesystem inodes of the backup file
> change each day?

Yes, for the same reason.  (Having the name, inode #, size, or mtime change
will result in Tarsnap reading the entire file from disk in order to look
for changes, of course, but it won't upload bits it recognizes as having
been previously uploaded.)

> 3) Does tarsnap recognize what data to send if only a small amount in
> the middle of the file changes?

It should, yes.

> I tested some scenarios by generating a 100MB file of random data.  I
> tarsnapped several times, and obviously, the full file is only
> transmitted the first time.  I then copied the file to a new name and
> tarsnapped again.  The file is not transmitted because it's identical to
> the original file, so it appears #1 and #2 are true.
> 
> Next I changed the new file so I had 25MB of identical data at the
> beginning of the file, 50MB of different data in the middle, and 25MB of
> identical data at the end.  I tarsnapped again, and this time, I saw:
> 
>                                        Total size  Compressed size
> All archives                            524618980        527202985
>   (unique data)                         183722213        184630527
> deduptest5                              104924533        105446860
>   (unique data)                          78795582         79187867
> 
> Tarsnap reports 75MB of unique data in this archive, instead of 50MB.

That's very weird.  I just did my own test with two 100 MB files which
were the same aside from their middle 50MB and I got the expected result
(200 MB total archive size, 150 MB post-deduplication):

# dd if=/dev/urandom bs=1M count=25 of=part.1
# dd if=/dev/urandom bs=1M count=50 of=part.2a
# dd if=/dev/urandom bs=1M count=50 of=part.2b
# dd if=/dev/urandom bs=1M count=25 of=part.3
# cat part.1 part.2a part.3 > file.a
# cat part.1 part.2b part.3 > file.b
# tarsnap --keyfile tarsnap.key --dry-run -c -f foo file.a file.b
[...]
This archive                            209847044        210883514
New data                                157539099        158316069

> Is that due to the design of the chunking algorithm and expected
> behavior?  If it is, is my best option to split the dump file into parts
> that will likely remain static and the ones that will change more
> frequently?

Can you re-run the test to make sure that you really had 50 MB in common
between your two files?

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid

Follow-Ups:
- Re: Database backup and deduplication question
  - From: Greg Larkin <glarkin@sourcehosting.net>

References:
- Database backup and deduplication question
  - From: Greg Larkin <glarkin@sourcehosting.net>

Prev by Date: Re: Database backup and deduplication question
Next by Date: Re: Database backup and deduplication question
Previous by thread: Re: Database backup and deduplication question
Next by thread: Re: Database backup and deduplication question
Index(es):
- Date
- Thread