[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Tarsnap outage



In case anyone here isn't subscribed to tarsnap-announce (if you're not, you
should be):

-------- Original Message --------
Subject: Tarsnap outage
Date: Fri, 29 Jun 2012 23:03:37 -0700
From: Colin Percival <cperciva@tarsnap.com>
To: tarsnap-announce@tarsnap.com

The important information first: All data stored on Tarsnap is safe, but you
can't access it right now.  I'm working to bring it back online.

The detailed story:

At approximately 2012-06-30 03:02 UTC there was a power outage affecting one
of the Availability Zones in Amazon's EC2 US-East-1 region.  Amazon reports
that this was related to severe electrical storms in the area; it's not clear
why their (multiple) backup power systems did not function or whether this is
related to the cascading power failure which affected a different Availability
Zone in EC2 US-East-1 on 2012-06-15.

The main Tarsnap server was in the Availability Zone affected by this outage,
and was knocked offline as a result.

When power was restored, I found that the power outage had resulted in some
inconsistencies in the filesystem used by the Tarsnap server code for metadata
caching.  The way the Tarsnap server code works, the "one true version" of all
service state is stored in Amazon S3, but some information is cached locally
(since going out to S3 for everything would be too slow); this cache metadata
needs to be in place in order for the Tarsnap service to handle requests.

While I can not see any clear evidence that any filesystem corruption has
affected the metadata cache -- and I think it is in fact quite unlikely -- I
can't rule out the possibility entirely; so out of an abundance of caution I
decided to regenerate the cached metadata from the "known good" data on Amazon
S3.  This has the effect of prolonging the Tarsnap outage, which I realize is
inconvenient, but I feel that my first priority must be to ensure that there
is no possibility of bringing the service back online in a state which could
result in any data loss.

Obviously this is not a good situation -- an EC2 outage which affects Tarsnap
is bad, and a Tarsnap outage which continues even after EC2 is fixed is much
worse.  I've been working on reworking the Tarsnap server code to remove this
sort of single point of failure -- some of you may have seen my "kivaloo" NoSQL
data store which I'm building for this purpose -- but unfortunately I haven't
gotten to the point of having that code in production yet.

Sorry about the headaches,
-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid