[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

spiped max qlen being hit

To: spiped@tarsnap.com
Subject: spiped max qlen being hit
From: Dave Cottlehuber <dch@skunkwerks.at>
Date: Tue, 20 Dec 2016 16:40:13 +0100

hi spipers,

I'm in the middle of a migration from Debian to FreeBSD, including
replacing autossh with spiped.

I've been using spiped for about ~ 6 months now with great success, its
proved more stable and resilient to transient network failure than
autossh. I've been using it mainly for monitoring events and security
log shipping - longterm stable connections.

However as I switched the main database servers over, connections
started to fail & in the end, I had to switch back to autossh. It's been
stable since then, but I'd prefer to use the right tool for the job
here, to avoid the stability issues we had in the past.

I've got a gist with more details and readable markdown, [1] but the
guts are here:

Once I switch more & more hosts over to the spiped tunnel to the DB, I
start seeing listen queue overflows on the spiped server process, then
shortly afterwards rate limiting & finally pf states blow-out.

Log messages:
	
<7>Dec 16 16:35:42 0 2016-12-16T16: 35:40.500908+00:00 beatrix kernel -
- [72288] sonewconn: pcb 0xfffff80d68996cb0: Listen queue overflow: 16
already in queue awaiting acceptance (64492 occurrences)

<5>Dec 16 16:35:42 0 2016-12-16T16: 35:42.555992+00:00 beatrix kernel -
- [72290] Limiting open port RST response from 11938 to 200 packets/sec

<5>Dec 16 16:35:43 0 2016-12-16T16: 35:43.561039+00:00 beatrix kernel -
- [72291] Limiting open port RST response from 26544 to 200 packets/sec

<6>Dec 16 16:35:44 0 2016-12-16T16: 35:44.170158+00:00 beatrix kernel -
- [72291] [zone: pf states] PF states limit reached

There are a few odd things - normally you'd look up the pcb using
`netstat -ALan` and see what process is responsible. Nothing with that
value showed up [2] , nor under fstat either. The same pcb would come up
for several minutes/hours so its very odd I didn't see it in netstat.

Other than a lot of TIME_WAIT states, I saw these for spiped:

root     spiped     61497 text /         94608 -r-xr-xr-x   69608  r
root     spiped     61497   wd /             4 drwxr-xr-x      26  r
root     spiped     61497 root /             4 drwxr-xr-x      26  r
root     spiped     61497    0 -         -         bad    -
root     spiped     61497    1 -         -         bad    -
root     spiped     61497    2 -         -         bad    -
root     spiped     61497    3* internet stream tcp fffff8012c82d820
root     spiped     61497    4* local stream fffff8003355f0f0 <->
fffff8003355f000
root     spiped     61497    5* local stream fffff8003355f000 <->
fffff8003355f0f0
root     spiped     24819 text /         94608 -r-xr-xr-x   69608  r
root     spiped     24819   wd /             4 drwxr-xr-x      26  r
root     spiped     24819 root /             4 drwxr-xr-x      26  r
root     spiped     24819    0 -         -         bad    -
root     spiped     24819    1 -         -         bad    -
root     spiped     24819    2 -         -         bad    -
root     spiped     24819    3* internet stream tcp fffff8012c9a0820
root     spiped     24819    4* local stream fffff800335e4e10 <->
fffff800335e4d20

It's never comforting to see "bad" in your terminal....

A few days later on I finally got a pcb match, and it does come from
spiped's listening port:

fffff80cca843000 tcp4  16/0/10             *.15984

I also beefed up the PF states table by a factor of 10 [4], and tweaked
some sysctls [5], which did seem to help, but not enough to be stable.

I've got the following info from bridget (the primary db server) [6] and
beatrix (the backup one) [7] during the period when I was provoking the
error using wrk[8].

netstat -ALan
netstat -i
netstat -s
pfctl -sall

Some questions:

1.  Am I using spiped incorrectly by having it handle many short-lived
connections? There are usually about 100, sometimes up round 200,
concurrent HTTP connections most of the time, spread across 6 main
"client" servers that all connect to the same tunnel endpoint.

2. Is it possible/wise to increase the accept queue length for spiped
somehow? I assume these are initial socket setup parameters, set here
https://github.com/Tarsnap/spiped/blob/master/libcperciva/util/sock.c#L329
? the backend DB is more than capable of handling the load, and accepts
tuning socket parameters directly.

3. During testing, i couldn't work out how spiped chooses which DNS
record to use when multiples are returned - specifically AAAA vs A
records. Is it possible to force spiped to use either IPv4 or IPv6 (like
a command-line option flag) instead of whatever comes back from hostname
lookup?

4. are there any other things I could do wrt FreeBSD that might help?

BTW obviously the root cause of this issue is that the applications that
connect to the DB doesn't have a circuit breaker pattern in it, so in
the event of a prolonged loss of connectivity, these apps hammer spiped
& FreeBSD furiously with continuous retries, and soaks up all available
ports. I'll deal with them later :D

A+
Dave

[1]: https://git.io/v19u0
[2]: https://git.io/v1NER
[4]: https://git.io/v1N0d
[5]: https://git.io/v1NET
[6]: https://git.io/v1Nul
[7]: https://git.io/v1NuE
[8]: https://github.com/wg/wrk

Follow-Ups:
- Re: spiped max qlen being hit
  - From: Colin Percival <cperciva@tarsnap.com>

Prev by Date: Re: nsdispatch errors on security/spiped when mdns lookups are enabled
Next by Date: Re: spiped max qlen being hit
Previous by thread: nsdispatch errors on security/spiped when mdns lookups are enabled
Next by thread: Re: spiped max qlen being hit
Index(es):
- Date
- Thread