[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Tarsnap outage 2011-07-23 08:00 -- 15:30 UTC

Hi all,

As many of you have noticed, there was a Tarsnap outage between 08:00 and 15:30
UTC today.  *Your data is safe* -- the outage only affected the Tarsnap front
end server.  This outage was caused by an unfortunate collision between back
luck and stupidity:
1. The linux kernel said "disk caches are yummy" and decided to use all of the
available RAM for disk cache.
2. The linux kernel then said "can haz more RAM for disk cache?" and launched
the out-of-memory killer to find some more memory.
3. The out-of-memory killer said "hey, there's only one process running which is
using a significant amount of RAM" and killed the Tarsnap server process.
4. I went to sleep about 30 minutes before this started.
5. Having been playing in a chamber music concert last night, I had my phone
turned off and I stupidly forgot to turn it back on.

The Tarsnap server is now running again, and hopefully won't be oomkilled again
in the near future.  In the next few days -- once I've finished testing, that is
-- I will be moving the Tarsnap server to a heftier EC2 instance, which should
prevent this as well as remedying other load-related issues I've seen recently;
this might cause another short outage (probably < 10 minutes), and I will send
out another email before the move occurs.

I will also be applying an credit equal to 8 days of Tarsnap storage to all
Tarsnap accounts; while Tarsnap doesn't have any formal SLA, a credit of 1 day
per hour of outage seems like the Right Thing To Do.

Colin Percival
Security Officer, FreeBSD | freebsd.org | The power to serve
Founder / author, Tarsnap | tarsnap.com | Online backups for the truly paranoid