At approximately 2016-07-24 10:15:19 UTC, the Tarsnap service went offline due to a sequence of events culminating in a filesystem becoming full. All Tarsnap customer data remained safely stored for the duration, but archives which were in the process of being created at that time would have failed ("too many network failures") and no further archives could be created or retrieved until the service came back online 85 minutes later at approximately 2016-07-24 11:40:40 UTC. While Tarsnap does not have any formal SLA, I have a (rather ill-defined) policy of issuing credits to Tarsnap users affected by outages or bugs in the Tarsnap code, based on my personal sense of fairness. As a result, I have credited Tarsnap users' accounts with 2 days of storage costs due to this 85 minute outage. Timeline of events (times in UTC) --------------------------------- Some time around June 30th: The Tarsnap server starts to experience an increase in correlated timeout failures from Amazon S3. While the overall S3 request failure rate did not significantly increase, the rate of failures of requests *which were retrying a failed request* increased substantially [1], resulting in a roughly 100x increase in the rate at which requests exhaust all their permitted retries and are considered to have failed. 2016-07-16 12:50:02: A routine background job kicks off which identifies unreferenced blocks of data, records in S3 that those blocks are no longer needed, then logs that change of status to disk. (This log is used later to remove mapping entries which tell the Tarsnap service where to find data in S3.) 142 times over the following week: Writes to Amazon S3 fail repeatedly, causing the aforementioned background job to abort. It automatically restarts. 2016-07-24 10:15:19: The log of blocks being freed, which now lists many blocks over a hundred times each, fills the filesystem on which it is stored. This causes other filesystem writes to fail, and since this is a "never event" it prompted an immediate shutdown of all the tarsnap service code. 2016-07-24 10:15:44: Tarsnap's monitoring systems determine that the Tarsnap service is not operating properly, and send me a text message. 2016-07-24 10:20:30: Tarsnap's monitoring systems determine that the Tarsnap service is still not operating properly, and phone me. 2016-07-24 11:20:30: Tarsnap's monitoring systems determine that the Tarsnap service is still not operating properly, and phone me again. This time I wake up, and upon checking my phone discover much to my surprise that, despite my phone being about 50cm away from me with the volume turned all the way up, I had somehow managed to sleep through the first text message and phone call. Over the next 20 minutes: I SSH into the server, see that the server code isn't running, and consult its error log to discover why; I notice the runaway log file and re-read the relevant code to convince myself that I'm correctly remembering how it works; and I delete that file, knowing that it will be safely regenerated when the background free-block-recording process is next run. 2016-07-24 11:40:04: I restart the server code which had aborted, and confirm that it's all working properly again. 2016-07-24 11:40:59: Tarsnap monitoring shows everything functioning normally. 2016-07-24 12:09:46: I send a quick email to the tarsnap-users list saying that there was an outage and promising a full post-mortem to follow (this one, although it took longer to write than the "later today" I initially expected). Lessons learned --------------- 1. I am apparently able to sleep through my phone ringing, at least once. I had assumed this couldn't happen, so I set the Tarsnap monitoring system to phone once an hour since I knew I wouldn't want to be bothered by further phone calls once I was working on resolving an issue. 2. While external "behave like a real user" monitoring remains the gold standard, better monitoring of internal metrics -- for example, the amount of space free on a filesystem, which gradually diminished over the course of a week -- could have alerted me to a problem before it caused an outage. [2] 3. Anomalous behaviour should be investigated, even if it is believed to be harmless. Earlier in July I noticed that there had been an increase in the fraction of S3 PUTs repeatedly failing; but since it did not seem to be having any significant effect on overall system performance, and I knew that all of the relevant code would retry failures, I did not investigate the root cause. While I don't know if I would have been able to uncover the reason behind the anomalous behaviour from Amazon S3, the mere act of investigating might have made me more alert to possible complications. 4. Post-mortems can take much longer to write than anticipated. I hope I never again complain that an Amazon Web Services post-mortem is taking too long to arrive... Final words ----------- While the ultimate precipitating factor seems to have be an as yet still undiagnosed change in Amazon S3 behaviour, the majority of the fault here lies in Tarsnap: Inappropriate handling of repeated S3 failures resulted in runaway disk space consumption; a lack of sufficient internal monitoring allowed it to go undetected until it became critical; and an inadequate understanding of the human factor -- that is, my ability to wake up when my phone rang -- resulted in the outage continuing for three times longer than it would have otherwise. We can do better; and I will be specifically keeping an eye out to ensure that this particular pattern of events does not recur. Nevertheless, I'm happy that the particular failure mode -- "something weird happened; shut down all the things" -- ran exactly as I hoped. While the Tarsnap service was unavailable, both for storing archives and for retrieving them, for a period of 85 minutes, shutting down the service until I could diagnose and fix the problem ensured that no user data was lost; I firmly believe that for a data backup service, avoiding any risk of data loss must always take priority over maintaining service availability. Footnotes --------- [1] It is not yet clear what the root cause within Amazon S3 is of this odd behaviour. With some additional logging which I've added in the past few days, it has become clear that there is an internal failure mode resulting in InternalError responses being generated after approximately 4.88 seconds; my observations thus far indicate that if three consecutive attempts yield this same failure mode, there is a 100% chance that subsequent retries of the same request will fail in the same way. Since Amazon S3 documentation states that failures should be retried and makes no mention of a possibility that retries will *never* succeed, I presume that this is not intended behaviour. [2] Internal monitoring does however come with a risk: In order for it to be useful, information must be conveyed out of the system being monitored. This implies the presence of additional code running on the system and a channel for the information to be conveyed -- neither of which is desirable in a high security environment. I will have to carefully consider the tradeoffs here and whether I can ensure adequate security before implementing any such new monitoring mechanisms. -- Colin Percival Security Officer Emeritus, FreeBSD | The power to serve Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
Attachment:
signature.asc
Description: OpenPGP digital signature