[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap outage on 2011-12-06



Hi all,

As many people noticed, there was a Tarsnap outage for about 3 hours starting
slightly after 13:30 UTC on 2011-12-06 (a week ago today).  No data was ever
at risk; the outage simply made existing archives inaccessible and made it
impossible to create new archives / delete old archives / etc.

This outage was 100% my fault and I've added a credit equal to 3x the daily
storage usage cost for 2011-12-06 to every Tarsnap account; while Tarsnap
doesn't have any formal SLA, providing an account credit for 1 day of storage
costs for each hour of outage seems like the right thing to do given that this
was caused entirely by my stupidity.

Technical details follow for the morbidly curious:

Tarsnap's server-side code synthesizes a log-structured filesystem out of S3
object storage.  Like all log-structured systems, this requires "cleaning" in
order to eliminate space wasted by deleted blocks.  In the current incarnation
of the Tarsnap server code, this is performed by two cooperating components: A
"slow" daemon which examines on-disk state to figure out which blocks need to
be "cleaned", and a central process which does all the communication with S3,
both for cleaning and on behalf of Tarsnap clients.

For performance reasons, the central Tarsnap server process gives priority to
"front-end" requests (those from Tarsnap clients) over "back-end" requests.
At peak periods, this can result in the queue of back-end S3 writes not being
serviced; to keep process memory consumption under control, there is code to
track how much memory is being used by requests from each connection (including
in the S3 write queue) and block the reading of additional requests.

However, the slow daemon uses the same timeout-and-retry connection-handling
code as the Tarsnap client.  As a result, if the central server process is under
heavy load for an extended period of time, multiple connections will be created
by the slow daemon, and each of them will receive an independent memory quota,
resulting in a steadily increasing amount of memory being consumed.

That's how things stood on December 5th, at least.

On December 6th, I rolled out an update to the server code which was intended to
prevent this increasing queue of back-end writes.  The "fix" was simple: If the
queue was too long, back-end writes would get dropped on the floor instead of
being queued.  (The slow cleaner daemon performs extra passes which detect and
handle this case, along with other reasons a block might not be cleaned, e.g.,
if S3 GETs fail.)  To that extent, the fix worked: The back-end write queue
stayed at a reasonable length.

Unfortunately I forgot that I had memory allocations being freed in response to
writes having completed.  This didn't cause any problems until there was a large
backlog of back-end writes for an extended period of time; but ultimately the
server code got into a state where it didn't want to add anything to the queue
because back-end operations were using a lot of RAM -- but didn't want to write
anything from the queue because the queue was almost empty -- resulting in the
memory usage increasing even faster than it had before the "fix".  Within a few
hours it ran out of memory and died.

It took me an embarrassingly long time to figure out what had happened, but once
I determined that the problem was related to log cleaning, I got things running
again with cleaning disabled; and a few days ago I rolled out a new fix which
ensures that when the back-end write queue gets too long, additional writes are
*freed immediately* instead of being merely not queued but freed later.  Having
seen this code path get triggered a few times now without problems, I'm
confident that this bug is well and firmly squashed.

Thanks for your patience and continued support,
-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid