[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap outage post-mortem



At approximately 2012-06-30 03:02 UTC, the central Tarsnap server (hosted in
Amazon's EC2 US-East region) went offline due to a power outage.  According
to Amazon's post-mortem of this incident [1] this was caused by two generator
banks independently failing to provide stable voltages after utility (grid)
power was lost during a severe electrical storm.

When the Tarsnap EC2 instance came back online about an hour later, I found
that the abrupt loss of power had resulted in filesystem corruption.  As I
explained in a blog post in December 2008 [2], Tarsnap stores all user data
in Amazon S3, but keeps metadata cached in EC2.  While I could not see any
evidence that the power loss had resulted in this cached metadata being
corrupted, I could not absolutely rule out the possibility -- so keeping in
mind that the first responsibility of a backup system is to avoid any
possibility of data loss or corruption, I decided to err on the side of
caution by treating the state on EC2 as "untrustworthy".

As I described in my December 2008 blog post, losing all state on EC2 is a
failure mode which Tarsnap is designed to survive, by reading data back from
S3 and replaying all of the operation log entries.  That said, losing local
state and needing to recover completely from off-site backups (which Amazon
S3 counts as, since it's replicated to multiple datacenters) is very much a
worst-case scenario.  It's also, quite literally, a nightmare scenario: A
few days before this outage, I had a nightmare about this happening, probably
provoked by the 2012-06-15 power outage which affected instances in a different
EC2 availability zone.

I restored Tarsnap to a fully operational state at 2012-07-01 22:50 UTC,
slightly less than 44 hours after the outage started.  Obviously any outage
is bad and an outage of this length is unacceptable; over the course of the
recovery process I learned several lessons which should make recovery from
any future "complete metadata loss" incidents faster (see below).

While Tarsnap does not have any formal SLA, I have a (rather ill-defined)
policy of issuing credits to Tarsnap users affected by outages or bugs in the
Tarsnap code, based on my personal sense of fairness.  While the original cause
of this outage was out of my control, the outage should have been much shorter,
and as a result I have credited Tarsnap users' accounts with 50% of a month of
storage costs.

Timeline of events
------------------

(Some of the times below are approximate; I was not worrying about keeping notes
as this process was underway so I've had to reconstruct the timeline based on
my memory, log files, and file timestamps.)

2012-06-30 03:02 UTC: Power is lost to the Tarsnap server, and I observe it
ceasing to respond to network traffic.  Due to two recent outages in the EC2
US-East region -- the power outage on June 15th, after which Amazon wrote in
a post-mortem that "We have also completed an audit of all our back-up power
distribution circuits" -- and a network outage earlier on June 30th (note that
these two earlier outages were far more limited in scope, and neither affected
Tarsnap) my initial presumption was that the outage was caused by a network
issue.

2012-06-30 03:21 UTC: My presumption that the outage was network-related is
reinforced by Amazon posting to its status page[3] that they were "investigating
connectivity issues".

2012-06-30 03:40 UTC: Amazon posts to its status page that "a large number of
instances [...] have lost power".  I start attempting to launch a replacement
EC2 instance in case it becomes necessary to perform a full state recovery from
S3 (as eventually proved to be the case), but my attempts fail due to EC2 APIs
being offline as a result of the power outage.

2012-06-30 04:03 UTC: Power is restored to the Tarsnap server, and I SSH in to
find that it suffered filesystem corruption as a result of the power outage.
Since I cannot rule out the possibility that the local state was corrupted, I
continue with plans to perform a full state recovery.

2012-06-30 04:37 UTC: I succeed in launching a replacement Tarsnap server and
start configuring it and installing the Tarsnap server code.  This process
includes creating and attaching Elastic Block Store disks using the AWS
Management Console, which is slowed down by repeated timeouts and errors.

2012-06-30 05:25 UTC: I finish configuring the replacement Tarsnap server and
start the process of regenerating its local state from S3.  The first phase of
this process involves reading millions of stored S3 objects; unfortunately,
these reads were performed in sequential order, triggering a worst-case
performance behaviour in S3 [4].  As a result, this phase of recovery took much
longer than I had anticipated; unfortunately, the design of the code meant that
changing the order in which objects were read was not something I could do "on
the fly".

2012-06-30 15:00 UTC: After inspecting the "outaged" Tarsnap server, I concluded
that the state corruption was almost certainly limited to archives committed in
the last few seconds before the power loss.  Consequently, I brought the server
back online in a read-only mode, so that anyone who needed their data urgently
could retrieve it before the full recovery process was complete.

2012-07-01 01:45 UTC: The first phase of Tarsnap recovery -- retrieving bits
from S3 -- completes, and the second phase -- reconstructing a "map" database
which identifies the location of each block of data within S3 -- starts.

2012-07-01 02:30 UTC: I notice that the map database reconstruction is running
anomalously slowly.  This turns out to be caused by some I/O rate-limiting code
I had put in place to prevent back-end processes starving the front-end Tarsnap
daemon (which has much stricter scheduling requirements) for I/O -- obviously,
those limits were not necessary during the reconstruction stage when there was
no front-end daemon running.

2012-07-01 02:58 UTC: I restart the second phase of Tarsnap recovery (without
I/O rate limits in place).

2012-07-01 11:18 UTC: The third (final) phase of Tarsnap recovery -- replaying
logs to reconstruct the server-side cached state for each machine -- starts.

2012-07-01 12:41 UTC: The third phase of Tarsnap recovery fails due to requests
to S3 timing out.  Normally the Tarsnap code retries failed S3 requests, but in
this particular code path that functionality was missing.  I increase the S3
timeouts by a factor of 10 and restart the third recovery phase.

2012-07-01 21:59 UTC: The third phase of Tarsnap recovery completes.  I look
over the reconstructed state (and compare it against the mostly-correct state
on the outaged server) to confirm that the reconstruction worked properly; then
start the Tarsnap server code and run some tests.

2012-07-01 22:50 UTC: I'm satisfied that everything is working properly, and
switch the Tarsnap server's "Elastic IP" address over to point at the new
server instance.

Lessons learned
---------------

1. Test disaster recovery processes *at scale*.  I have always tested the
process for recovering from state stored on S3 every time I prepare to roll out
new code -- it's part of my "make test" -- but I have always done this on a
small data set.  A test with six orders of magnitude less data may help to
confirm that a recovery process works, but it certainly doesn't test that the
process works *quickly*; and both success and speed are important.

2. Disaster recovery processes should not rely on the AWS Management Console.
According to Amazon [1] the EC2 and EBS control plane was restored by 2012-06-30
04:10 UTC, but I was experiencing timeouts and errors (most commonly "too many
requests") from the Console over an hour beyond that point.  Obviously when a
disaster occurs there is a large influx of users into the Management Console;
evidently it needs work to improve its scalability.

3. Disaster recovery processes should start at the first sign of an outage.  In
this particular case it wouldn't have made any difference since for the first
hour after the outage began it was impossible to launch a replacement EC2
instance; but that is an EC2 failure mode which Amazon says they are addressing,
so in the future starting the recovery immediately rather than waiting to find
out if the outage is a transient network issue or more severe could save time.

4. Sequential accesses to Amazon S3 are bad.  This isn't so much a "lesson
learned" as a "lesson I remembered applies here" -- I was aware of it but hadn't
realized that it would slow down the recovery process so much.

5. The appropriate behaviours during state recovery are not always the same as
the appropriate behaviours during normal operations.  Disk I/O rate limiting
of back-end processes and filesystem syncing are normally useful, but not so
much when it is important to recover state and get back online as quickly as
possible.  Code shared between "normal" and "recovery" operations -- which is,
in fact, most of the Tarsnap server code -- should be aware of which mode it's
running in and control those behaviours appropriately.

6. Tarsnap users are amazing people.  (Ok, I knew this already, but still, this
episode reinforced it.)  I didn't see a single irate email, tweet, or comment on
IRC during the entire outage; people politely asked if there was an outage and
if it was related to the EC2 outage, and aside from that the only communications
I had were very positive, mostly thanking me for my efforts and status reports.

Final words
-----------

After writing so much about Tarsnap, I'd like to take a moment to provide some
wider context to this outage.  The power outage which knocked Tarsnap offline
was big -- judging by Amazon's statement of "7% of instances" and the number of
IP addresses in the EC2 US-East region, somewhere around 50,000 instances went
offline -- and as I'm sure everybody reading this is aware, also affected such
"big names" as Netflix, Pinterest, Instagram, and Heroku.  Most people stop at
this point, but there's more to the story than that.

This power outage was caused by a "derecho" thunderstorm system [5] which is
believed to be one of the most severe non-hurricane storm systems in North
American history.  22 people are believed to have died as a direct result of
the storm, and over 3 million homes and businesses lost power.  Approximately
a million people are still without power now, five days later.

Even worse, this storm system was caused by a heat wave which has set record
temperatures in hundreds of locations across the Eastern US, with many areas
exceeding 38 C (100 F) for several days in a row.  For the elderly, the young,
and individuals with chronic diseases, prolonged exposure to these temperatures
can be life-threatening -- and without electricity, air conditioning is not
available.  By the time this heat wave is over, it could easily be responsible
for hundreds or even thousands of deaths.

For many of us, a datacenter losing power is the only effect we will see from
this storm.  For most of the people who were directly affected by the storm,
it's the least of their worries.

References
----------

[1] http://aws.amazon.com/message/67457/
[2] http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-aws.html
[3] http://status.aws.amazon.com/
[4]
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
[5] http://en.wikipedia.org/wiki/June_2012_North_American_derecho

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid