[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Backing up large unchanging files



On 7/16/25 02:10, Tim Bishop wrote:
On Mon, Jul 14, 2025 at 07:07:10PM +0100, Tim Bishop wrote:
On Mon, Jul 14, 2025 at 03:53:50PM +0000, Colin Percival wrote:
On 7/14/25 08:46, Tim Bishop wrote:
I'm backing up some large unchanging files (web server logs). Aside from
the current log, they mostly are unchanging on a daily basis. As per
recommendations I've not compressed these files which gives Tarsnap the
best chance to deduplicate and compress.

But, the problem is that Tarsnap is reading these files every day in
their entirety. I guess it has to so it can identify changed blocks, but
this is making the backup take a long time and creates a fair amount of
I/O. And aside from the monthly log rollover, these files haven't
changed from one day to the next.

Assuming you're not running with --lowmem, tarsnap should recognize files
which haven't had their {path, inode number, size, mtime} change since the
last backup.  So it should only be re-reading the file which is currently
being written, not the rotated logs.

Thanks - that's what I hoped would happen. But not what I saw (no
--lowmem option in use). Here's an example rotated log file:

It happened again the following night. I couldn't replicate with a dry
run, or with a dry run using --lowmem. However, a dry run with
--verylowmem did exhibit the same behaviour.

Does Tarsnap automatically enable lowmem/verylowmem in any circumstance?
For example, if system memory is low.

Aha!  I had completely forgotten about this -- it's something I implemented
in 2008 -- but yes, there's a scenario where tarsnap switches into "lowmem"
mode.  Specifically, if you're archiving a large number of small files, the
default ("normalmem") regime can end up caching a lot of archive data; to
avoid this, tarsnap keeps track of the memory used to track "trailers" (aka
data at the end of a file which is too small to be its own block) and stops
storing those if they're taking up more memory than tarsnap is using to keep
track of complete blocks of data.

The good news here is that
1. In this scenario, *most* large files still get completely cached; it's just
an unlucky few (around 5-10% of them) which happen to be a number of complete
chunks plus a small extra bit which get affected, and
2. Tarsnap is still caching the list of complete chunks, so while it has to
re-read the file every time it's doing the far less cpu-intensive process of
computing hashes to verify that the data hasn't changed, rather than running
all of the data through the (considerably more cpu-intensive) chunking code.

I *think* I can address this with a patch to say "cache trailers on large
files even if we've decided on our own to switch into lowmem mode".  You've
been using Tarsnap for a long time; can I assume that you can test a patch
for me once I've written something?

--
Colin Percival
FreeBSD Release Engineering Lead & EC2 platform maintainer
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid