[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Potential spiped 1.6.1 release -- please test



On Mon, Dec 30, 2019 at 06:48:16PM -0800, Graham Percival wrote:
> On Sat, Dec 28, 2019 at 02:00:44PM +0200, Peter Pentchev wrote:
> > It turns out that at least on Debian testing the pthread_cancel() function
> > doesn't play very nice with threads that have already terminated - it
> > returns ESRCH (no such thread/process).
> 
> New idea: what if the thread *hasn't started* yet?
> 
> In the "application usage" (informative section), the standard says:
>     There is no requirement on the implementation that the ID of the created
>     thread be available before the newly created thread starts executing. 
>     https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_create.html
> 
> So I'm imagining:
>   1. pushbits() runs pthread_create()
>   2. pushbits() finishes executing, and returns to main
>   3. test code calls pthread_cancel()
>   4. problem occurs
>   5. ... if no problem had occurred, *then* the new thread would start
>      executing and the thread ID would become valid.
> 
> I've sketched a solution in the
> https://github.com/Tarsnap/spiped/tree/pushbits-wait-for-start
> branch.  It feels like overkill, but based on my interpretation of the
> standard I think it's necessary [1].
> 
> [1] my original solution used semaphores and was much cleaner, but OSX doesn't
> support unnamed semaphores (!!!).  So I went with the clumsier approach of a
> mutex, conditional variable, and int.

Wow, did I ever drop the ball on that... Well, no, I didn't really drop
the ball, it's just that I hadn't had any time for F/OSS work in the
past month; sorry, I should have at least let you know :(

TL;DR: nope, the problem is that with nothing on stdin (< /dev/null)
the work thread in the echo_stdin_stdout test completes very, very
quickly, and, yes, pthread_cancel() returns ESRCH on a finished thread,
even though pthread_join() would succeed on it later.

Anyway, I tested your "make sure the thread starts" patches and they
didn't really help. I added a couple of debugging printfs (while fully
realizing that it is possible for them to change the timing of stuff),
and it seems that the problem is indeed that the work thread *finishes*
executing before the main test thread invokes pthread_cancel().
You may see my current WIP tree (it does not actually build a Debian
package due to the failing test) at:

  https://gitlab.com/spiped/spiped/-/tree/debian/experimental

Here's the diagnostic output (produced by the test-report patch) of the
test in question:

  09-pushbits... test_pushbits: RDBG pushbits: starting
test_pushbits: RDBG pushbits: about to invoke pthread_create()
test_pushbits: RDBG pushbits: got workthread 0x7fe7b86c1700
test_pushbits: RDBG workthread: starting
test_pushbits: RDBG workthread: done the lock thing
test_pushbits: RDBG workthread_cleanup: starting
test_pushbits: RDBG workthread_cleanup: done
test_pushbits: RDBG pushbits: returning 0
test_pushbits: RDBG echo_stdin_stdout: starting
test_pushbits: RDBG pushbits: starting
test_pushbits: RDBG pushbits: about to invoke pthread_create()
test_pushbits: RDBG pushbits: got workthread 0x7f759beb3700
test_pushbits: RDBG workthread: starting
test_pushbits: RDBG workthread: done the lock thing
test_pushbits: RDBG pushbits: returning 0test_pushbits: 
test_pushbits: RDBG echo_stdin_stdout: got thread 0x7f759beb3700RDBG workthread_cleanup: starting

test_pushbits: RDBG workthread_cleanup: done
test_pushbits: RDBG echo_stdin_stdout: about to invoke pthread_cancel for 0x7f759beb3700
test_pushbits: pthread_cancel: No such process
test_pushbits: RDBG pushbits: starting
test_pushbits: RDBG pushbits: about to invoke pthread_create()
test_pushbits: RDBG pushbits: got workthread 0x7f5abbc88700
test_pushbits: RDBG workthread: starting
test_pushbits: RDBG workthread: done the lock thing
test_pushbits: RDBG pushbits: returning 0
test_pushbits: RDBG workthread_cleanup: starting
test_pushbits: RDBG workthread_cleanup: done
test_pushbits: RDBG pushbits: starting
test_pushbits: RDBG pushbits: about to invoke pthread_create()
test_pushbits: RDBG pushbits: got workthread 0x7f9abf41c700
test_pushbits: RDBG workthread: starting
test_pushbits: RDBG workthread: done the lock thing
test_pushbits: RDBG pushbits: returning 0
test_pushbits: RDBG pushbits: starting
test_pushbits: RDBG pushbits: about to invoke pthread_create()
test_pushbits: RDBG pushbits: got workthread 0x7f9abec1b700
test_pushbits: RDBG workthread: starting
test_pushbits: RDBG workthread: done the lock thing
test_pushbits: RDBG pushbits: returning 0
test_pushbits: RDBG workthread_cleanup: starting
test_pushbits: RDBG workthread_cleanup: starting
test_pushbits: RDBG workthread_cleanup: done
test_pushbits: RDBG workthread_cleanup: done
test_pushbits: RDBG echo_start_stop: invoking pushbits
test_pushbits: RDBG pushbits: starting
test_pushbits: RDBG pushbits: about to invoke pthread_create()
test_pushbits: RDBG pushbits: got workthread 0x7fabf79ed700
test_pushbits: RDBG workthread: starting
test_pushbits: RDBG workthread: done the lock thing
test_pushbits: RDBG workthread_cleanup: starting
test_pushbits: RDBG pushbits: returning 0
test_pushbits: RDBG echo_start_stop: about to invoke pthread_cancel for 0x7fabf79ed700
test_pushbits: RDBG workthread_cleanup: done
test_pushbits: RDBG echo_start_stop: about to invoke pthread_join for 0x7fabf79ed700
test_pushbits: RDBG echo_start_stop: success
FAILED!
make[1]: *** [Makefile:83: test] Error 1

So it seems that the difference between the echo_stdin_stdout and
echo_start_stop tests is that the latter takes a bit longer to finish;
the work thread is still about to end when the main thread invokes
pthread_cancel(), but it has not quite ended yet, so pthread_cancel()
does not report an error. In echo_stdin_stdout though, the
workthread_cleanup() function has finished executing before the main
thread calls pthread_cancel() - and here's where the problem happens.

Apparently glibc really, really doesn't believe in cancelling a thread
that has already completed... I'm not really sure what to think about
that, since, well, the thread hasn't been joined yet, so technically it
still exists, albeit in a zombie-like state - and a little test program
of mine shows that even after a thread has exited and pthread_cancel()
returns ESRCH, a call to pthread_join() will succeed and will provide
the correct value returned by the thread function. I'm not sure whether
this is glibc or the Linux kernel or something else - the behavior is
the same on three systems I've tested it on:
- Debian testing, libc6-2.29 (amd64), kernel 5.4.0-3-amd64
- CentOS 6, glibc-2.12-1.212.el6_10.3.x86_64, kernel 2.6.32-504.16.2.el6.x86_64
- CentOS 7, glibc-2.17-292.el7.x86_64, kernel 3.10.0-1062.1.1.el7.x86_64

Actually, there was a bit of a "whaaat?" moment while testing it on
the CentOS system: a plain "make test" run on the command line would
actually succeed... that is, until it hit me that the main difference
between running the test on the command line and running it via sbuild
on my Debian system is that, well, sbuild does not provide anything on
the build process's standard input. So "make test < /dev/null" quickly
and reproducibly failed on the CentOS 6 system, too. Once I knew that,
I tested both on CentOS 7 with the same result: "make test" succeeds,
but "make test < /dev/null" fails every time, with or without my
test-report warn0-debugging patch.

So I could keep the check for ESRCH as a Debian-specific patch, or you
could make it conditional on __linux__ or __gnu_linux__ or something
similar. What do you think?

Many apologies again for the long delay!

G'luck,
Peter

-- 
Peter Pentchev  roam@{ringlet.net,debian.org,FreeBSD.org} pp@storpool.com
PGP key:        http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint 2EE7 A7A5 17FC 124C F115  C354 651E EFB0 2527 DF13

Attachment: signature.asc
Description: PGP signature