Hi everyone, I'm sure most of you are aware of the AWS outage yesterday, which started some time around 17:30 UTC and was mostly resolved around 20:45 UTC. (For more accurate times and details of what went wrong, wait for the post-mortem which I'm sure will be published by AWS in the days to come.) I'm sending this email to provide some additional details about how it affected Tarsnap. The outage had no immediate effects on Tarsnap; however, around 18:00 (about half an hour after the outage started) a large proportion of Amazon SimpleDB requests started timing out. Tarsnap uses SimpleDB for its accounting data; this is "legacy" code in that it's long overdue for replacement (it will be replaced with a system built on the kivaloo data store). When connections arrive at the Tarsnap service, queries are sent to SimpleDB to check the account balance of the relevant user, so that an error can be sent back for accounts which have run out of funds (or which never had funds added in the first place). Due to previous SimpleDB-related glitches, the Tarsnap service treats errors reading the account balance permissively -- that is, requests are allowed unless it *knows* that the account balance is not positive. Unfortunately, the code in question still waits for a response from SimpleDB. As a result, when SimpleDB requests started timing out (without returning errors), the handling of incoming connections to the Tarsnap service became very slow, with many requests timing out. After some careful re-reading of the relevant code (I wrote it 13 years ago and wanted to make sure I didn't break anything further) I determined that the safest workaround was to point the SimpleDB traffic at Amazon S3 -- which very reliably and quickly sent errors back. (My apologies to anyone at AWS who was confused to see SimpleDB requests landing at S3 nodes!) Once errors were being sent back quickly, the Tarsnap service reverted to its fail-safe behaviour of allowing requests through. I reverted this configuration change around 2:00 UTC, after confirming that SimpleDB was operating normally. Attempts to add funds to Tarsnap accounts and/or to read recent account activity continued to experience errors until SimpleDB recovered at around 22:00 UTC, since the workaround I applied to the Tarsnap service did not help with the website. For clarity: No data was ever lost; any archives successfully created (as reported by the tarsnap client code) remain intact. As always, I'm happy to answer any further questions. Sincerely, -- Colin Percival Security Officer Emeritus, FreeBSD | The power to serve Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature