[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap partial outage, 2021-12-07 18:00-19:50 UTC

Hi everyone,

I'm sure most of you are aware of the AWS outage yesterday, which started
some time around 17:30 UTC and was mostly resolved around 20:45 UTC.  (For
more accurate times and details of what went wrong, wait for the post-mortem
which I'm sure will be published by AWS in the days to come.)  I'm sending
this email to provide some additional details about how it affected Tarsnap.

The outage had no immediate effects on Tarsnap; however, around 18:00 (about
half an hour after the outage started) a large proportion of Amazon SimpleDB
requests started timing out.  Tarsnap uses SimpleDB for its accounting data;
this is "legacy" code in that it's long overdue for replacement (it will be
replaced with a system built on the kivaloo data store).  When connections
arrive at the Tarsnap service, queries are sent to SimpleDB to check the
account balance of the relevant user, so that an error can be sent back for
accounts which have run out of funds (or which never had funds added in the
first place).

Due to previous SimpleDB-related glitches, the Tarsnap service treats errors
reading the account balance permissively -- that is, requests are allowed
unless it *knows* that the account balance is not positive.  Unfortunately,
the code in question still waits for a response from SimpleDB.  As a result,
when SimpleDB requests started timing out (without returning errors), the
handling of incoming connections to the Tarsnap service became very slow,
with many requests timing out.

After some careful re-reading of the relevant code (I wrote it 13 years ago
and wanted to make sure I didn't break anything further) I determined that
the safest workaround was to point the SimpleDB traffic at Amazon S3 -- which
very reliably and quickly sent errors back.  (My apologies to anyone at AWS
who was confused to see SimpleDB requests landing at S3 nodes!)  Once errors
were being sent back quickly, the Tarsnap service reverted to its fail-safe
behaviour of allowing requests through.  I reverted this configuration change
around 2:00 UTC, after confirming that SimpleDB was operating normally.

Attempts to add funds to Tarsnap accounts and/or to read recent account
activity continued to experience errors until SimpleDB recovered at around
22:00 UTC, since the workaround I applied to the Tarsnap service did not
help with the website.

For clarity: No data was ever lost; any archives successfully created (as
reported by the tarsnap client code) remain intact.

As always, I'm happy to answer any further questions.

Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid

Attachment: OpenPGP_signature
Description: OpenPGP digital signature