I promised a post-mortem three weeks ago after I brought the Tarsnap service back online. It took me an unforgivably long time to get around to writing this, but here it is. At approximately 2023-07-02 13:07:58 UTC, the central Tarsnap server (hosted in Amazon's EC2 us-east-1 region) went offline suffering a "failed system status check". As a virtual machine, this could mean many things, including a power outage, a hardware failure on the physical server, or an outage in EC2's network; all I can say is that since I haven't seen reports of any widespread EC2 issues at the same time, it was most likely just an isolated hardware fault. Tarsnap's monitoring systems detected the failure at 2023-07-02 13:10 UTC (I have monitoring writing/reading/deleting archives every 5 minutes from a different EC2 region) and alerted me. (The text message at 13:10 UTC didn't wake me up, but the phone call at 13:15 UTC did.) My initial investigation didn't reveal any clearly transient errors so I assumed the system was dead and started spinning up a replacement EC2 instance. At approximately 2023-07-02 13:52 UTC (45 minutes after the initial outage) Amazon restarted the failed server on a new EC2 instance; this brought up the operating system (FreeBSD) but did not restart the Tarsnap server code since I don't have the system configured to start that automatically -- if anything causes the system to unexpectedly reboot, I want to check things over before any tarsnap traffic gets handled, since "preventing data loss if something breaks" is far more important than "maximize service availability". The server logs after rebooting showed filesystem corruption; it's clear that whatever took the system offline either killed the hardware or severed it from the Elastic Block Store which holds its filesystems. I decided to continue with setting up a new server rather than attempting to recover the old one. The Tarsnap service stores data in Amazon S3 as a log-structured filesystem with each S3 object consisting of a header with metadata for all of the log entries, followed by (optionally) data for the log entries. For example, a "start write transaction" log entry has a header identifying the machine and transaction nonce but has no log data, while a "store data block" log entry has a header identifying the machine and block name, along with the block data. Under normal conditions, the log entry metadata is also cached in EC2 and is never read from Amazon S3; the only reads from Amazon S3 are to read block data in response to requests from the tarsnap client. The process of recovering the EC2 instance state consists of two steps: First, reading all of the metadata headers from S3; and second, "replaying" all of those operations locally. (These cannot be performed at the same time, since the use of log-structured storage means that log entries are "rewritten" to free up storage when data is deleted; log entries contain sequence numbers to allow them to be replayed in the correct order, but they must be sorted into the correct order after being retrieved before they can be replayed.) The first step proceeded without incident, completing at 2023-07-03 01:49:49 UTC. In hindsight it probably could have been faster: I had the recovery process configured to make 250 simultaneous requests to Amazon S3 because that is what Amazon S3 could sustain a decade ago, but I suspect that this could be significantly increased now. The second step failed almost immediately, with an error telling me that a replayed log entry was recording data belonging to a machine which didn't exist. This provoked some head-scratching until I realized that this was introduced by some code I wrote in 2014: Occasionally Tarsnap users need to move a machine between accounts, and I handle this storing a new "machine registration" log entry and deleting the previous one. Unfortunately while I had tests for this, I never tested regenerating the server state after a machine is "re-owned" *while having data stored* -- and since the new machine "registration" log entry has a higher sequence number, the server was quite right in objecting that data was stored belonging to a machine which didn't exist... yet. Once I figured out what was going on, I disabled that "seatbelt" and resumed the state reconstruction process. It then failed again almost immediately complaining that it couldn't find data in Amazon S3, because my attempt at "resuming" the process involved skipping the downloading-data-from-S3 step and thereby left a "maximum log entry sequence number" value uninitialized (and therefore set to zero). A short time later I had that fixed and the state reconstruction process continued properly. This process was also somewhat slower than it should have been; had I realized that it was disk throughput bound I would have configured the relevant EBS volume for higher throughput. Unfortunately by this point I was quite sleep deprived so I wasn't monitoring the process closely -- otherwise I would have noticed this (in both gstat(8) and Amazon CloudWatch) and reconfigured the EBS volume. By about 2023-07-03 15:10 UTC (I didn't record the exact time) the state reconstruction process had completed. I ran some quick tests with the server in read-only mode and compared against the old server state (to make sure that things matched up aside from the old server's filesystem having lost the last few seconds of data when the server failed), and then I brought it online. The first live post-outage traffic was at 2023-07-03 15:25:58 UTC, roughly 26 hours and 16 minutes after the outage started. Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, on 2023-07-13 (after some dust settled and I caught up on some sleep) I credited everyone's Tarsnap accounts with 50% of a month's storage costs. As always, feel free to email me if you have any questions. -- Colin Percival FreeBSD Deputy Release Engineer & EC2 platform maintainer Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature