On Mon, Dec 30, 2019 at 06:48:16PM -0800, Graham Percival wrote: > On Sat, Dec 28, 2019 at 02:00:44PM +0200, Peter Pentchev wrote: > > It turns out that at least on Debian testing the pthread_cancel() function > > doesn't play very nice with threads that have already terminated - it > > returns ESRCH (no such thread/process). > > New idea: what if the thread *hasn't started* yet? > > In the "application usage" (informative section), the standard says: > There is no requirement on the implementation that the ID of the created > thread be available before the newly created thread starts executing. > https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_create.html > > So I'm imagining: > 1. pushbits() runs pthread_create() > 2. pushbits() finishes executing, and returns to main > 3. test code calls pthread_cancel() > 4. problem occurs > 5. ... if no problem had occurred, *then* the new thread would start > executing and the thread ID would become valid. > > I've sketched a solution in the > https://github.com/Tarsnap/spiped/tree/pushbits-wait-for-start > branch. It feels like overkill, but based on my interpretation of the > standard I think it's necessary [1]. > > [1] my original solution used semaphores and was much cleaner, but OSX doesn't > support unnamed semaphores (!!!). So I went with the clumsier approach of a > mutex, conditional variable, and int. Wow, did I ever drop the ball on that... Well, no, I didn't really drop the ball, it's just that I hadn't had any time for F/OSS work in the past month; sorry, I should have at least let you know :( TL;DR: nope, the problem is that with nothing on stdin (< /dev/null) the work thread in the echo_stdin_stdout test completes very, very quickly, and, yes, pthread_cancel() returns ESRCH on a finished thread, even though pthread_join() would succeed on it later. Anyway, I tested your "make sure the thread starts" patches and they didn't really help. I added a couple of debugging printfs (while fully realizing that it is possible for them to change the timing of stuff), and it seems that the problem is indeed that the work thread *finishes* executing before the main test thread invokes pthread_cancel(). You may see my current WIP tree (it does not actually build a Debian package due to the failing test) at: https://gitlab.com/spiped/spiped/-/tree/debian/experimental Here's the diagnostic output (produced by the test-report patch) of the test in question: 09-pushbits... test_pushbits: RDBG pushbits: starting test_pushbits: RDBG pushbits: about to invoke pthread_create() test_pushbits: RDBG pushbits: got workthread 0x7fe7b86c1700 test_pushbits: RDBG workthread: starting test_pushbits: RDBG workthread: done the lock thing test_pushbits: RDBG workthread_cleanup: starting test_pushbits: RDBG workthread_cleanup: done test_pushbits: RDBG pushbits: returning 0 test_pushbits: RDBG echo_stdin_stdout: starting test_pushbits: RDBG pushbits: starting test_pushbits: RDBG pushbits: about to invoke pthread_create() test_pushbits: RDBG pushbits: got workthread 0x7f759beb3700 test_pushbits: RDBG workthread: starting test_pushbits: RDBG workthread: done the lock thing test_pushbits: RDBG pushbits: returning 0test_pushbits: test_pushbits: RDBG echo_stdin_stdout: got thread 0x7f759beb3700RDBG workthread_cleanup: starting test_pushbits: RDBG workthread_cleanup: done test_pushbits: RDBG echo_stdin_stdout: about to invoke pthread_cancel for 0x7f759beb3700 test_pushbits: pthread_cancel: No such process test_pushbits: RDBG pushbits: starting test_pushbits: RDBG pushbits: about to invoke pthread_create() test_pushbits: RDBG pushbits: got workthread 0x7f5abbc88700 test_pushbits: RDBG workthread: starting test_pushbits: RDBG workthread: done the lock thing test_pushbits: RDBG pushbits: returning 0 test_pushbits: RDBG workthread_cleanup: starting test_pushbits: RDBG workthread_cleanup: done test_pushbits: RDBG pushbits: starting test_pushbits: RDBG pushbits: about to invoke pthread_create() test_pushbits: RDBG pushbits: got workthread 0x7f9abf41c700 test_pushbits: RDBG workthread: starting test_pushbits: RDBG workthread: done the lock thing test_pushbits: RDBG pushbits: returning 0 test_pushbits: RDBG pushbits: starting test_pushbits: RDBG pushbits: about to invoke pthread_create() test_pushbits: RDBG pushbits: got workthread 0x7f9abec1b700 test_pushbits: RDBG workthread: starting test_pushbits: RDBG workthread: done the lock thing test_pushbits: RDBG pushbits: returning 0 test_pushbits: RDBG workthread_cleanup: starting test_pushbits: RDBG workthread_cleanup: starting test_pushbits: RDBG workthread_cleanup: done test_pushbits: RDBG workthread_cleanup: done test_pushbits: RDBG echo_start_stop: invoking pushbits test_pushbits: RDBG pushbits: starting test_pushbits: RDBG pushbits: about to invoke pthread_create() test_pushbits: RDBG pushbits: got workthread 0x7fabf79ed700 test_pushbits: RDBG workthread: starting test_pushbits: RDBG workthread: done the lock thing test_pushbits: RDBG workthread_cleanup: starting test_pushbits: RDBG pushbits: returning 0 test_pushbits: RDBG echo_start_stop: about to invoke pthread_cancel for 0x7fabf79ed700 test_pushbits: RDBG workthread_cleanup: done test_pushbits: RDBG echo_start_stop: about to invoke pthread_join for 0x7fabf79ed700 test_pushbits: RDBG echo_start_stop: success FAILED! make[1]: *** [Makefile:83: test] Error 1 So it seems that the difference between the echo_stdin_stdout and echo_start_stop tests is that the latter takes a bit longer to finish; the work thread is still about to end when the main thread invokes pthread_cancel(), but it has not quite ended yet, so pthread_cancel() does not report an error. In echo_stdin_stdout though, the workthread_cleanup() function has finished executing before the main thread calls pthread_cancel() - and here's where the problem happens. Apparently glibc really, really doesn't believe in cancelling a thread that has already completed... I'm not really sure what to think about that, since, well, the thread hasn't been joined yet, so technically it still exists, albeit in a zombie-like state - and a little test program of mine shows that even after a thread has exited and pthread_cancel() returns ESRCH, a call to pthread_join() will succeed and will provide the correct value returned by the thread function. I'm not sure whether this is glibc or the Linux kernel or something else - the behavior is the same on three systems I've tested it on: - Debian testing, libc6-2.29 (amd64), kernel 5.4.0-3-amd64 - CentOS 6, glibc-2.12-1.212.el6_10.3.x86_64, kernel 2.6.32-504.16.2.el6.x86_64 - CentOS 7, glibc-2.17-292.el7.x86_64, kernel 3.10.0-1062.1.1.el7.x86_64 Actually, there was a bit of a "whaaat?" moment while testing it on the CentOS system: a plain "make test" run on the command line would actually succeed... that is, until it hit me that the main difference between running the test on the command line and running it via sbuild on my Debian system is that, well, sbuild does not provide anything on the build process's standard input. So "make test < /dev/null" quickly and reproducibly failed on the CentOS 6 system, too. Once I knew that, I tested both on CentOS 7 with the same result: "make test" succeeds, but "make test < /dev/null" fails every time, with or without my test-report warn0-debugging patch. So I could keep the check for ESRCH as a Debian-specific patch, or you could make it conditional on __linux__ or __gnu_linux__ or something similar. What do you think? Many apologies again for the long delay! G'luck, Peter -- Peter Pentchev roam@{ringlet.net,debian.org,FreeBSD.org} pp@storpool.com PGP key: http://people.FreeBSD.org/~roam/roam.key.asc Key fingerprint 2EE7 A7A5 17FC 124C F115 C354 651E EFB0 2527 DF13
Attachment:
signature.asc
Description: PGP signature