[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Backing up large unchanging files



On Thu, Jul 17, 2025 at 07:20:45PM +0000, Colin Percival wrote:
> On 7/16/25 02:10, Tim Bishop wrote:
> > On Mon, Jul 14, 2025 at 07:07:10PM +0100, Tim Bishop wrote:
> > > On Mon, Jul 14, 2025 at 03:53:50PM +0000, Colin Percival wrote:
> > > > On 7/14/25 08:46, Tim Bishop wrote:
> > > > > I'm backing up some large unchanging files (web server logs). Aside from
> > > > > the current log, they mostly are unchanging on a daily basis. As per
> > > > > recommendations I've not compressed these files which gives Tarsnap the
> > > > > best chance to deduplicate and compress.
> > > > > 
> > > > > But, the problem is that Tarsnap is reading these files every day in
> > > > > their entirety. I guess it has to so it can identify changed blocks, but
> > > > > this is making the backup take a long time and creates a fair amount of
> > > > > I/O. And aside from the monthly log rollover, these files haven't
> > > > > changed from one day to the next.
> > > > 
> > > > Assuming you're not running with --lowmem, tarsnap should recognize files
> > > > which haven't had their {path, inode number, size, mtime} change since the
> > > > last backup.  So it should only be re-reading the file which is currently
> > > > being written, not the rotated logs.
> > > 
> > > Thanks - that's what I hoped would happen. But not what I saw (no
> > > --lowmem option in use). Here's an example rotated log file:
> > 
> > It happened again the following night. I couldn't replicate with a dry
> > run, or with a dry run using --lowmem. However, a dry run with
> > --verylowmem did exhibit the same behaviour.
> > 
> > Does Tarsnap automatically enable lowmem/verylowmem in any circumstance?
> > For example, if system memory is low.
> 
> Aha!  I had completely forgotten about this -- it's something I implemented
> in 2008 -- but yes, there's a scenario where tarsnap switches into "lowmem"
> mode.  Specifically, if you're archiving a large number of small files, the
> default ("normalmem") regime can end up caching a lot of archive data; to
> avoid this, tarsnap keeps track of the memory used to track "trailers" (aka
> data at the end of a file which is too small to be its own block) and stops
> storing those if they're taking up more memory than tarsnap is using to keep
> track of complete blocks of data.
> 
> The good news here is that
> 1. In this scenario, *most* large files still get completely cached; it's just
> an unlucky few (around 5-10% of them) which happen to be a number of complete
> chunks plus a small extra bit which get affected, and
> 2. Tarsnap is still caching the list of complete chunks, so while it has to
> re-read the file every time it's doing the far less cpu-intensive process of
> computing hashes to verify that the data hasn't changed, rather than running
> all of the data through the (considerably more cpu-intensive) chunking code.

Hmm, in this particular archive there's <30k files to backup. Is that a
"large number"? It's mostly small files, plus a very small percentage of
huge ones.

> I *think* I can address this with a patch to say "cache trailers on large
> files even if we've decided on our own to switch into lowmem mode".  You've
> been using Tarsnap for a long time; can I assume that you can test a patch
> for me once I've written something?

Absolutely!

Tim.