[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
2023-07-02 -- 2023-07-03 Tarsnap outage post-mortem

To: tarsnap-announce@tarsnap.com
Subject: 2023-07-02 -- 2023-07-03 Tarsnap outage post-mortem
From: Colin Percival <cperciva@tarsnap.com>
Date: Wed, 26 Jul 2023 21:05:15 -0700
I promised a post-mortem three weeks ago after I brought the Tarsnap service
back online.  It took me an unforgivably long time to get around to writing
this, but here it is.

At approximately 2023-07-02 13:07:58 UTC, the central Tarsnap server (hosted
in Amazon's EC2 us-east-1 region) went offline suffering a "failed system
status check".  As a virtual machine, this could mean many things, including
a power outage, a hardware failure on the physical server, or an outage in
EC2's network; all I can say is that since I haven't seen reports of any
widespread EC2 issues at the same time, it was most likely just an isolated
hardware fault.

Tarsnap's monitoring systems detected the failure at 2023-07-02 13:10 UTC (I
have monitoring writing/reading/deleting archives every 5 minutes from a
different EC2 region) and alerted me.  (The text message at 13:10 UTC didn't
wake me up, but the phone call at 13:15 UTC did.)  My initial investigation
didn't reveal any clearly transient errors so I assumed the system was dead
and started spinning up a replacement EC2 instance.

At approximately 2023-07-02 13:52 UTC (45 minutes after the initial outage)
Amazon restarted the failed server on a new EC2 instance; this brought up the
operating system (FreeBSD) but did not restart the Tarsnap server code since I
don't have the system configured to start that automatically -- if anything
causes the system to unexpectedly reboot, I want to check things over before
any tarsnap traffic gets handled, since "preventing data loss if something
breaks" is far more important than "maximize service availability".

The server logs after rebooting showed filesystem corruption; it's clear that
whatever took the system offline either killed the hardware or severed it from
the Elastic Block Store which holds its filesystems.  I decided to continue
with setting up a new server rather than attempting to recover the old one.

The Tarsnap service stores data in Amazon S3 as a log-structured filesystem
with each S3 object consisting of a header with metadata for all of the log
entries, followed by (optionally) data for the log entries.  For example, a
"start write transaction" log entry has a header identifying the machine and
transaction nonce but has no log data, while a "store data block" log entry
has a header identifying the machine and block name, along with the block
data.  Under normal conditions, the log entry metadata is also cached in EC2
and is never read from Amazon S3; the only reads from Amazon S3 are to read
block data in response to requests from the tarsnap client.

The process of recovering the EC2 instance state consists of two steps: First,
reading all of the metadata headers from S3; and second, "replaying" all of
those operations locally.  (These cannot be performed at the same time, since
the use of log-structured storage means that log entries are "rewritten" to
free up storage when data is deleted; log entries contain sequence numbers
to allow them to be replayed in the correct order, but they must be sorted
into the correct order after being retrieved before they can be replayed.)

The first step proceeded without incident, completing at 2023-07-03 01:49:49
UTC.  In hindsight it probably could have been faster: I had the recovery
process configured to make 250 simultaneous requests to Amazon S3 because that
is what Amazon S3 could sustain a decade ago, but I suspect that this could be
significantly increased now.

The second step failed almost immediately, with an error telling me that a
replayed log entry was recording data belonging to a machine which didn't
exist.  This provoked some head-scratching until I realized that this was
introduced by some code I wrote in 2014: Occasionally Tarsnap users need to
move a machine between accounts, and I handle this storing a new "machine
registration" log entry and deleting the previous one.  Unfortunately while
I had tests for this, I never tested regenerating the server state after a
machine is "re-owned" *while having data stored* -- and since the new machine
"registration" log entry has a higher sequence number, the server was quite
right in objecting that data was stored belonging to a machine which didn't
exist... yet.

Once I figured out what was going on, I disabled that "seatbelt" and resumed
the state reconstruction process.  It then failed again almost immediately
complaining that it couldn't find data in Amazon S3, because my attempt at
"resuming" the process involved skipping the downloading-data-from-S3 step
and thereby left a "maximum log entry sequence number" value uninitialized
(and therefore set to zero).  A short time later I had that fixed and the
state reconstruction process continued properly.

This process was also somewhat slower than it should have been; had I realized
that it was disk throughput bound I would have configured the relevant EBS
volume for higher throughput.  Unfortunately by this point I was quite sleep
deprived so I wasn't monitoring the process closely -- otherwise I would have
noticed this (in both gstat(8) and Amazon CloudWatch) and reconfigured the
EBS volume.

By about 2023-07-03 15:10 UTC (I didn't record the exact time) the state
reconstruction process had completed.  I ran some quick tests with the server
in read-only mode and compared against the old server state (to make sure that
things matched up aside from the old server's filesystem having lost the last
few seconds of data when the server failed), and then I brought it online.
The first live post-outage traffic was at 2023-07-03 15:25:58 UTC, roughly
26 hours and 16 minutes after the outage started.

Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people
credits for outages when it seems fair" policy, on 2023-07-13 (after some dust
settled and I caught up on some sleep) I credited everyone's Tarsnap accounts
with 50% of a month's storage costs.

As always, feel free to email me if you have any questions.

--
Colin Percival
FreeBSD Deputy Release Engineer & EC2 platform maintainer
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
Attachment: OpenPGP_signature
Description: OpenPGP digital signature
Prev by Date: Tarsnap is back online
Next by Date: Tarsnap 1.0.41
Previous by thread: Tarsnap outage 2023-07-02
Next by thread: Tarsnap 1.0.41
Index(es):
- Date
- Thread