[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap outage 2016-07-24 10:15:19--11:40:04



At approximately 2016-07-24 10:15:19 UTC, the Tarsnap service went offline
due to a sequence of events culminating in a filesystem becoming full.  All
Tarsnap customer data remained safely stored for the duration, but archives
which were in the process of being created at that time would have failed
("too many network failures") and no further archives could be created or
retrieved until the service came back online 85 minutes later at approximately
2016-07-24 11:40:40 UTC.

While Tarsnap does not have any formal SLA, I have a (rather ill-defined)
policy of issuing credits to Tarsnap users affected by outages or bugs in
the Tarsnap code, based on my personal sense of fairness.  As a result, I
have credited Tarsnap users' accounts with 2 days of storage costs due to
this 85 minute outage.

Timeline of events (times in UTC)
---------------------------------

Some time around June 30th: The Tarsnap server starts to experience an
increase in correlated timeout failures from Amazon S3.  While the overall
S3 request failure rate did not significantly increase, the rate of failures
of requests *which were retrying a failed request* increased substantially
[1], resulting in a roughly 100x increase in the rate at which requests
exhaust all their permitted retries and are considered to have failed.

2016-07-16 12:50:02: A routine background job kicks off which identifies
unreferenced blocks of data, records in S3 that those blocks are no longer
needed, then logs that change of status to disk.  (This log is used later
to remove mapping entries which tell the Tarsnap service where to find data
in S3.)

142 times over the following week: Writes to Amazon S3 fail repeatedly,
causing the aforementioned background job to abort.  It automatically
restarts.

2016-07-24 10:15:19: The log of blocks being freed, which now lists many
blocks over a hundred times each, fills the filesystem on which it is stored.
This causes other filesystem writes to fail, and since this is a "never
event" it prompted an immediate shutdown of all the tarsnap service code.

2016-07-24 10:15:44: Tarsnap's monitoring systems determine that the Tarsnap
service is not operating properly, and send me a text message.

2016-07-24 10:20:30: Tarsnap's monitoring systems determine that the Tarsnap
service is still not operating properly, and phone me.

2016-07-24 11:20:30: Tarsnap's monitoring systems determine that the Tarsnap
service is still not operating properly, and phone me again.  This time I wake
up, and upon checking my phone discover much to my surprise that, despite my
phone being about 50cm away from me with the volume turned all the way up, I
had somehow managed to sleep through the first text message and phone call.

Over the next 20 minutes: I SSH into the server, see that the server code
isn't running, and consult its error log to discover why; I notice the runaway
log file and re-read the relevant code to convince myself that I'm correctly
remembering how it works; and I delete that file, knowing that it will be
safely regenerated when the background free-block-recording process is next
run.

2016-07-24 11:40:04: I restart the server code which had aborted, and confirm
that it's all working properly again.

2016-07-24 11:40:59: Tarsnap monitoring shows everything functioning normally.

2016-07-24 12:09:46: I send a quick email to the tarsnap-users list saying
that there was an outage and promising a full post-mortem to follow (this one,
although it took longer to write than the "later today" I initially expected).

Lessons learned
---------------

1. I am apparently able to sleep through my phone ringing, at least once.  I
had assumed this couldn't happen, so I set the Tarsnap monitoring system to
phone once an hour since I knew I wouldn't want to be bothered by further
phone calls once I was working on resolving an issue.

2. While external "behave like a real user" monitoring remains the gold
standard, better monitoring of internal metrics -- for example, the amount of
space free on a filesystem, which gradually diminished over the course of a
week -- could have alerted me to a problem before it caused an outage. [2]

3. Anomalous behaviour should be investigated, even if it is believed to be
harmless.  Earlier in July I noticed that there had been an increase in the
fraction of S3 PUTs repeatedly failing; but since it did not seem to be
having any significant effect on overall system performance, and I knew that
all of the relevant code would retry failures, I did not investigate the root
cause.  While I don't know if I would have been able to uncover the reason
behind the anomalous behaviour from Amazon S3, the mere act of investigating
might have made me more alert to possible complications.

4. Post-mortems can take much longer to write than anticipated.  I hope I
never again complain that an Amazon Web Services post-mortem is taking too
long to arrive...

Final words
-----------

While the ultimate precipitating factor seems to have be an as yet still
undiagnosed change in Amazon S3 behaviour, the majority of the fault here
lies in Tarsnap: Inappropriate handling of repeated S3 failures resulted in
runaway disk space consumption; a lack of sufficient internal monitoring
allowed it to go undetected until it became critical; and an inadequate
understanding of the human factor -- that is, my ability to wake up when my
phone rang -- resulted in the outage continuing for three times longer than
it would have otherwise.  We can do better; and I will be specifically keeping
an eye out to ensure that this particular pattern of events does not recur.

Nevertheless, I'm happy that the particular failure mode -- "something weird
happened; shut down all the things" -- ran exactly as I hoped.  While the
Tarsnap service was unavailable, both for storing archives and for retrieving
them, for a period of 85 minutes, shutting down the service until I could
diagnose and fix the problem ensured that no user data was lost; I firmly
believe that for a data backup service, avoiding any risk of data loss must
always take priority over maintaining service availability.

Footnotes
---------

[1] It is not yet clear what the root cause within Amazon S3 is of this odd
behaviour.  With some additional logging which I've added in the past few
days, it has become clear that there is an internal failure mode resulting
in InternalError responses being generated after approximately 4.88 seconds;
my observations thus far indicate that if three consecutive attempts yield
this same failure mode, there is a 100% chance that subsequent retries of the
same request will fail in the same way.  Since Amazon S3 documentation states
that failures should be retried and makes no mention of a possibility that
retries will *never* succeed, I presume that this is not intended behaviour.

[2] Internal monitoring does however come with a risk: In order for it to be
useful, information must be conveyed out of the system being monitored.  This
implies the presence of additional code running on the system and a channel
for the information to be conveyed -- neither of which is desirable in a high
security environment.  I will have to carefully consider the tradeoffs here
and whether I can ensure adequate security before implementing any such new
monitoring mechanisms.

-- 
Colin Percival
Security Officer Emeritus, FreeBSD | The power to serve
Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid

Attachment: signature.asc
Description: OpenPGP digital signature