I try to build mono on FreeBSD-current (it is a dependency of some GNOME program). Unfortunately this does not work correctly.
What I see are hangs of the build. If I stop the build when it hangs and restart it, it will continue and succeed to process the build steps a little bit further, but then it hangs again.
If I ktrace the hanging process, I see that there is a call to wait returning with the error message that the child does not exist. Then there is a call to nanosleep.
It looks to me like this process missed some SIGCLD (or is waiting for something which did not exist at all), and a loop is waiting for a child to exit. This loop probably has no proper condition for the fact that there is no such child (anymore). As such it will stay forever in this loop.
So I grepped a litte bit around in mono and found the following code in <mono-src-dir>/mcs/class/Mono.Posix/Mono.Unix/UnixProcess.cs:
public void WaitForExit () { int status; int r; do { r = Native.Syscall.waitpid (pid, out status, (Native.WaitOptions) 0); } while (UnixMarshal.ShouldRetrySyscall (r)); UnixMarshal.ThrowExceptionForLastErrorIf (r); }
This does look a little bit as it could be related to the problem I see, but ShouldRetrySyscall only returns true if the errno is EINTR. So this looks correct. 🙁
I looked a little bit more at this file and it looks like either I do not understand the semantic of this language, or GetProcessStatus does return the returnvalue of the waitpid call instead of the status (which is not what it shall return to my understanding). If I am correct, it can not really detect the status of a process. It would be very bad if such a fundamental thing went unnoticed in mono… which does not put a good light on the unit-tests (if any) or the general testing of mono. For this reason I hope I am wrong.
I did not stop there, as this part does not look like it is the problem. I found the following in mono/io-layer/processes.c:
static gboolean waitfor_pid (gpointer test, gpointer user_data) { ... do { ret = waitpid (process->id, &status, WNOHANG); } while (errno == EINTR); if (ret <= 0) { /* Process not ready for wait */ #ifdef DEBUG g_message ("%s: Process %d not ready for waiting for: %s", __func__, process->id, g_strerror (errno)); #endif return (FALSE); } #ifdef DEBUG g_message ("%s: Process %d finished", __func__, ret); #endif process->waited = TRUE; ... }
And here we have the problem, I think. I changed the (ret <= 0) to (ret == 0 || (ret < 0 && errno != ECHILD)). This will not really give the correct status, but at least it should not block anymore and I should be able to see the difference during the build.
And now after testing, I see a difference, but the problem is still there. The wait with ECHILD is gone in the loop, but there is still some loop with a semaphore operation:
62960 mono CALL clock_gettime(0xd,0xbf9feef8)
62960 mono RET clock_gettime 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL nanosleep(0xbf9fef84,0)
62960 mono RET nanosleep 0
62960 mono CALL clock_gettime(0xd,0xbf9feef8)
62960 mono RET clock_gettime 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL semop(0x20c0000,0xbf9feef6,0x1)
62960 mono RET semop 0
62960 mono CALL nanosleep(0xbf9fef84,0)
OK, there is more going on. I think someone with more knowledge about mono should have a look at this (do not only look at this semop thing, but also look why it loses a child).