[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Recent partial Tarsnap outages



Hi all,

In the intervals between
  2009-01-13 12:59:22 and 2009-01-13 14:04:20
and between
  2009-01-15 00:15:10 and 2009-01-15 01:04:38
(all times in UTC),  Tarsnap suffered a partial outage.  No data stored was
affected, but attempts to list, extract, or delete archives, or attempts to
*start* creating archives during those intervals, may have failed with a
"Too many network failures" error.  Archive creations which were already in
progress when one of those two time periods began will have continued without
being affected by this.

I have identified the cause of these partial outages and have taken steps to
ensure that they do not recur.

[Technical details follow.]

Tarsnap uses the Amazon SimpleDB service for all of its internal accounting
data; an account's current balance, payments, and usage are all stored in
SimpleDB.  When the Tarsnap client connects to the Tarsnap service and attempts
to read data or start an archive transaction, the user's account balance is
obtained from SimpleDB, and if the account balance is not positive, the service
sends a "add more money" error response back to the Tarsnap client.

During the two above-mentioned outages, the SimpleDB service stopped responding
to requests from Tarsnap.  It is not entirely clear why this happened; Amazon's
"Service Health Dashboard" mentions "Elevated Latencies" during one of the two
intervals; this may be related.

In the design of the Tarsnap service, I originally assumed that any SimpleDB
failures would be temporary in nature (as, indeed, they have been prior to these
two intervals) so I made the Tarsnap server code handle a failure in the Tarsnap
accounting system the same way as it handles other transient failures -- by
dropping the relevant connection and allowing the Tarsnap client to reconnect
and reissue its request.  Unfortunately this approach does not work well in the
context of persistent failures; so I have now revised the Tarsnap server code to
handle an accounting subsystem failure by assuming that all users have positive
account balances.  Consequently, even if the underlying error condition in
SimpleDB recurs (whatever it might be), it will not have any impact on Tarsnap
users.

-- 
Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid