[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Outage 2013-07-03 ~11:10 - 11:40 UTC



Hi all,

I meant to send out an email about this earlier but it slipped my mind
thanks to a nasty cold I've been battling.  (Being woken up by my phone
at 4:10 AM local time couldn't have helped much either, come to think
of it...)

Yesterday there was a ~30 minute partial outage caused by a glitch in
the Amazon SimpleDB service used by Tarsnap for tracking user account
balances.  Unfortunately rather than the service failing -- in which
case I've set the Tarsnap server to "fail open" and assume that account
balances are positive -- it was timing out, resulting in the Tarsnap
server code taking too long to respond and the Tarsnap client giving up.

This would have caused any archive extracts and any archive creations
started during the outage period to fail.  Archive creations started
before ~11:10 UTC would be unaffected by this even if they continued
into the outage period.

Any operations affected by this outage would have failed with tarsnap
printing "tarsnap: Too many network failures".  If you did not see this
error message, you were not affected.  (Unless you don't read tarsnap's
output, of course.)

While I have no particular reason to expect this problem to recur, the
SimpleDB service seems to have been relegated to a "legacy" status (it
has always had a rather wonky design, and Amazon now has Amazon RDS and
Amazon DynamoDB, which are better for almost all contexts) so it's not
a service I'm entirely comfortable trusting to be reliable any more; in
the short term I intend to prevent this sort of outage by making the
Tarsnap server code more aggressive about giving up on SimpleDB, and in
the longer term I intend to move away from SimpleDB entirely.

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid