[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Feature request: --sort option



Hi Brian,

I'll present three things: a workaround, an acknowledgement, and
potentially a different solution for your problem.  :)

Workaround for --sort: I stumbled across this myself two weeks
ago, although I was aiming for merely "deterministic", rather than
any particular ordering.

    export LC_ALL=C
    find BACKUP_DIR | sort > filelist.txt

    tarsnap --my --usual --args	\
        -n -T filelist.txt

    rm filelist

-T reads from a listlist, but you also need the -n in there,
otherwise libarchive will see BACKUPDIR/ and start recursively
adding all files and dirs from it, ignoring the `sort`.

If you only read & write from the same system, you don't need
LC_ALL, but I needed it when switching from Ubuntu to FreeBSD.
My profile was slightly different on the two platforms -- it was
something like "en" vs. "en_CA", and that changed the sort order
of capital letters.  (Don't quote me on the exact locales and
their precise effect on the sort order, though!  All I cared about
was that they were annoyingly different.)


As for acknowledgement: yes, it would be nice to have --sort, or
at least some form of determistic file order.  Libarchive has
considered it, but it's low priority:
https://github.com/libarchive/libarchive/issues/602
As you noted, such a switch would require more memory, which is
why libarchive has been hesitant to adopt it.  I expect that as
more people get interested in reproducible builds, there will be
more interest in this.


Workaround for progress: the next version of the client will
include
  tarsnap --progress-bytes SIZE
which prints a progress message every SIZE bytes (of unencrypted
file size, not post-compression size).  So if you know the sum of
files in your backup director(ies), then you can get an idea of
how far it's progressed.

Now, due to the way that libarchive counts directory sizes, hard
links, and soft links, simply doing
    du -s DIR
won't give you the same answer.  As long as you're not using
hardlinks, I think this works:

#!/bin/sh
ls -laR "$@" | grep -v "^d" | grep -v "^l" | awk '{sum += $5} END{print sum}'

although IIRC if you're on linux then you need to change the $5 to
$4.  (I'm still preparing a draft of this material.)

Cheers,
- Graham

On Fri, Jul 03, 2020 at 03:37:40PM -0700, Brian L. Matthews wrote:
> On macOS at least, tarsnap processes a directory in an effectively random
> order (I suppose it's probably creation order). Unfortunately that makes
> watching -v or using SIGINFO not very useful, as knowing what file is being
> processed doesn't tell me much when I don't know if it's the first or last
> or somewhere in between. GNU tar has a --sort option that can be used to
> lexicographically sort each directory's contents. While that would take more
> memory, on modern systems it doesn't seem like it would be a problem, and it
> could be forced off under --lowmem (or cause an immediate error if they're
> both specified). Looking at the source, it seems like it would just take a
> couple of changes to tree_next in tar/tree.c (although admittedly I spent
> about 10 seconds looking at it :-) ).
> 
> Thanks,
> Brian