Actually I just managed to interact with a hung qemu under a debugger sufficiently to confirm what is happening here.
CMake's code for running child processes (in kwsys/ProcessUNIX.c) does this:
"On UNIX, a child process is forked to exec the program. Three output pipes are read by the parent process using a select call to block until data are ready. Two of the pipes are stdout and stderr for the child. The third is a special pipe populated by a signal handler to indicate that a child has terminated. This is used in conjunction with the timeout on the select call to implement a timeout for program even when it closes stdout and stderr and at the same time avoiding races."
So (assuming no timeout set up) we can get the following race:
* spawn child process
* parent gets to point of making select() syscall
* this takes the parent process into qemu's linux-user/main.c code
* child process exits
* host kernel sends SIGCHLD to parent
* qemu's signal handler queues this SIGCHLD and does a cpu_exit, which will make the parent take the signal at the next basic block
* parent code (still inside main.c or syscall.c) does the actual host select() syscall
* this blocks forever, because the thing that would wake it up is the signal handler writing to the pipe we're selecting on, but we will never run the signal handler until select exits
Fixing this bug will indeed require the significant rework I referred to in comment #14, I'm afraid. Don't hold your breath...
Actually I just managed to interact with a hung qemu under a debugger sufficiently to confirm what is happening here.
CMake's code for running child processes (in kwsys/ProcessUN IX.c) does this:
"On UNIX, a child process is forked to exec the program. Three output pipes are read by the parent process using a select call to block until data are ready. Two of the pipes are stdout and stderr for the child. The third is a special pipe populated by a signal handler to indicate that a child has terminated. This is used in conjunction with the timeout on the select call to implement a timeout for program even when it closes stdout and stderr and at the same time avoiding races."
So (assuming no timeout set up) we can get the following race:
* spawn child process
* parent gets to point of making select() syscall
* this takes the parent process into qemu's linux-user/main.c code
* child process exits
* host kernel sends SIGCHLD to parent
* qemu's signal handler queues this SIGCHLD and does a cpu_exit, which will make the parent take the signal at the next basic block
* parent code (still inside main.c or syscall.c) does the actual host select() syscall
* this blocks forever, because the thing that would wake it up is the signal handler writing to the pipe we're selecting on, but we will never run the signal handler until select exits
Fixing this bug will indeed require the significant rework I referred to in comment #14, I'm afraid. Don't hold your breath...