[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Tarsnap performance issues in late March / most of April



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In case anyone is subscribed here but not to tarsnap-announce:

- -------- Forwarded Message --------
Subject: Tarsnap performance issues in late March / most of April
Date: Tue, 05 May 2015 15:26:37 -0700
From: Colin Percival <cperciva@tarsnap.com>
To: tarsnap-announce@tarsnap.com

Hi everyone,

As several people have noticed, the performance of the Tarsnap service towards
the end of March and through most of April was lower than normal; this caused
backups to take longer to complete, and in some cases to fail entirely.

Unfortunately these problems were caused by a combination of factors, which,
combined with my unwillingness to make changes to the service without having
first performed sufficient testing to be certain that the changes would be
safe, considerably extended the time required to return to normal performance.
Had I realized that it would take so long, I would have sent out an email to
the mailing lists earlier; but since at each point I thought I was "one change
away" from fixing the problems, I kept on delaying said email until it was
clear that the problems were finally fixed.

The good news is that as of ten days ago, performance is now at or slightly
better than previous normal levels, at least to the limits of my monitoring.
Obviously, if anyone is still experiencing diminished Tarsnap performance, I
would like to hear about it.

The rest of this email discusses the causes of the performance degredation
and the steps I took to resolve it; while none of this is directly relevant
to Tarsnap users, some may find it interesting and/or informative.

Technical details
- -----------------

The largest single factor in the performance problems was what is described
by the wonderful euphemism "success failure": Over the years, tarsnap has
gained users, and the amount of data being pushed through the service has
increased as a result.  While I had performed performance testing of the
service, simulated traffic never quite matches reality, and it turns out
that my testing was deficient in some areas.  Given the performance impact
of increasing traffic, on a few occasions during April I came close to
turning off new account creation (new accounts tend to send more traffic than
established accounts, since tarsnap's deduplication means that a first backup
typically involves uploading far more data than a second or subsequent
backup), but I ultimately decided that this was unnecessary.

The hint of problems showed up on March 3rd, in the form of the service
intermittently hitting its connection limit and rejecting incoming traffic.
My first thought was that this was being caused by bursts of connections
arriving from ntp-synchronized servers on "round number" times -- this is a
phenomenon I've noticed before, with rate of arriving connections increasing
by a factor of 50 within the five seconds following :00:00, :15:00, and
:30:00 in each hour -- but it turned out that during the busy hours between
00:00 and 10:00 UTC (aka. the middle of the night for Europe and North
America) this problem was being exacerbated by a backlog of "internal"
connections.

While the tarsnap service spends most of its time shuffling bits back and
forth between the client code and Amazon S3, there are some lower-priority
"background" tasks it also performs for managing data; one of these is to
print statistics about its internal state.  When I first wrote this code, I
made a design decision which came back to haunt me later: Because handling
these background tasks was not at all urgent, I placed them on a secondary
queue, to only be handled when there was nothing more important to do.  So
far so good.  Unfortunately, until these tasks were handled, the connection
on which they arrived would not be dropped... and to make things worse, the
code which sent these background-task requests used the same retry logic as
the standard tarsnap client-server protocol, namely that if a request is not
handled within 30 seconds it will time out -- and reissue the request.

It took me a few days to figure out what was going on here, but once I did
the solution was straightforward: Rather than always eschewing low-priority
tasks in favour of normal traffic, I introduced a 1% chance of picking a task
randomly.  This 1% chance had no adverse effect on performance, but allowed
the background tasks to be handled reasonably promptly and prevented the
creation of any backlog of internal connections.  This solved the problem of
refused connections and things returned to normal for a while.

In the last week of March I started to see diminished performance during the
busy nighttime hours.  I think this was caused by an increased volume of data
being read out of tarsnap -- reads involve more I/O than writes, because the
block metadata needed to retrieve a block must be read, whereas for writing a
block all that is required is to confirm that the block has not been stored
previously -- but I was not able to absolutely confirm that this was the
cause, because...

At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which most
of this metadata was stored suddenly changed from an average latency of 1.2
ms per request to an average latency of 2.2 ms per request.  I have no idea
why this happened -- indeed, I was so surprised by it that I didn't believe
Amazon's monitoring systems at first -- but this immediately resulted in the
service being I/O limited.

Over the following weeks, I took several steps to improve performance and
reduce the time spent waiting for I/O:

1. On April 5th, I disabled some background tasks, including the deletion
of "expired" Tarsnap accounts and the aforementioned tasks which had been
clogging up the service connection pool; this did not significantly increase
throughput, but it provided a modest reduction in system load, thus allowing
a slight increase in *user-visible* throughput.

2. On April 9th, I deployed new server code to make use of the Intel AESNI
instructions; while this did nothing to reduce I/O, this made it possible to
push more bits through during the periods when I/O was not being waited upon.

3. On April 14th, I added a new index for a database used by read operations;
I had not previously indexed this metadata in this manner because the volume
of reads had historically not been enough to justify the extra indexing work,
but the recently-increased volume of read traffic makes it appropriate.

4. On April 18th, I created a new EBS volume and moved some of the tarsnap
storage metadata (specifically, lists of blocks stored by each machine) across
to it.  This yielded two improvements: First, splitting metadata between two
disks allowed for more I/O throughput than one; and second, the patterns of
I/O -- the metadata I moved had bursts of heavy I/O as backup transactions
were committed, while the metadata used by reads (which remained on the old
virtual disk) had a more constant rate of latency-sensitive requests -- meant
that separating the data had indirect benefits on performance.

At this point I/O was no longer a critical performance-limiting factor, but
performance was still suboptimal due to a problem I had noticed two months
earlier and had been writing -- and testing -- code to fix ever since: There
was a form of priority inversion going on.  Normally priority is given within
the Tarsnap server code to performing network I/O, since that is more latency
sensitive than other operations; but the task of kicking off writes into
Amazon S3 was triggered by a lower-priority timer.  As a result, during
periods of heavy load requests would be read over the network... but couldn't
be completed (which required logging data into S3 for durability).  This did
not cause a complete deadlock because each incoming connection was limited in
how many requests it could send -- at a certain point the server would stop
listening for incoming requests until it had worked down its queue -- but it
was nonetheless slowing down request throughput during times when there was a
large volume of incoming traffic.

The eventual solution I applied to this was to suspend reading requests from
a connection after every time a request had been read, and then only resume
reading when a zero-second timer "expired".  This ensured that the "kick off
writes into S3" timer wouldn't get starved out, and had the side effect of
increasing "fairness" between connections (since it ensured that if there were
requests arriving from multiple connections, each connection would get one
request read before they were re-enabled for reading the next request).  I
deployed this change -- after considerable testing, since it was not a trivial
change -- on April 24th, and shortly after that final change the Tarsnap
service performance returned to normal levels.

Lessons learned
- ---------------

1. Increasing traffic makes all bottlenecks worse, including the ones which
were previously identified as "not worth worrying about".

2. Sometimes the performance of the "cloud" suddenly drops, for no apparent
reason at all.

3. Performing well in tests doesn't always mean that you'll hit the same
benchmarks when handling real-world traffic, because tests rarely replicate
the exact pattern of incoming requests.

4. Measures which improve performance in the average case may have the side
effect of masking a problem until the worst case arrives.

5. When performance drops, it's not always due to a single problem; sometimes
there are multiple interacting bottlenecks.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iEYEARECAAYFAlVJRK0ACgkQOM7KaQxqam5gNwCgpCOqXAWEm9/Lq1mVVOE5nnLd
nu4AoLMNSOSf8DJFrVCXEp7rjNODFuZx
=hlre
-----END PGP SIGNATURE-----