New DTrace probes for the lin­ux­u­la­tor

I for­ward port­ed my DTrace probes for the FreeB­SD lin­ux­u­la­tor from a 2008-current to a recent ‑cur­rent. I have not the com­plete FreeB­SD lin­ux­u­la­tor cov­ered, but a big part is already done. I can check the major locks in the lin­ux­u­la­tor, trace futex­es, and I have a D‑script which yells at a lot of errors which could hap­pen but should not.

Some of my D‑scripts need some changes, as real-world test­ing showed that they are not real­ly work­ing as expect­ed. They can get over­whelmed by the amount of spec­u­la­tion and dynam­ic vari­ables (error mes­sage: dynam­ic vari­able drops with non-empty dirty list). For the dynam­ic vari­ables prob­lem I found a dis­cus­sion on the net with some sug­ges­tions. For the spec­u­la­tion part I expect sim­i­lar tuning-possibilities.

Unfor­tu­nate­ly the D‑script which checks the inter­nal locks fails to com­pile. Seems there is a lit­tle mis­un­der­stand­ing on my side how the D‑language is sup­posed to work.

I try to get some time lat­er to have a look at those prob­lems.

Dur­ing my devel­op­ment I stum­bled over some gener­ic DTrace prob­lems with the SDT provider I use for my probes:

  • If you load the Lin­ux mod­ule after the SDT mod­ule, your sys­tem will pan­ic as soon as you want to access some probes, e.g. “dtrace ‑l” will pan­ic the sys­tem. Load­ing the Lin­ux mod­ule before the SDT mod­ule pre­vents the pan­ic.
  • Unload­ing the SDT mod­ule while the Lin­ux mod­ule with the SDT probes is still loaded pan­ics the sys­tem too. Do not unload the Lin­ux mod­ule if you run with my patch.

Accord­ing to avg@ those are known prob­lems, but I think nobody is work­ing on this. This is bad, because this means I can not com­mit my cur­rent patch­set.

If some­one wants to try the new DTrace probes for the lin­ux­u­la­tor, feel free to go to http://www.Leidinger.net/FreeBSD/current-patches/ and down­load linuxulator-dtrace.diff. I do not offer a work­ing hyper­link here on pur­pose, the SDT bugs can hurt if you are not care­ful, and I want to make the use of this patch a strong opt-in because of this. If the patch hurts you, it is your fault, you have been warned.

Send to Kin­dle

Enabled some rewrite rules, email me in case of prob­lems

I enabled some rewrite rules to deny access to my site in case there is a refer­rer not from my web­serv­er when post­ing com­ments in word­press, or if there is some unwant­ed user agent when access­ing any con­tent.

Legit­i­mate access is not sup­posed to be blocked, so if you notice a block, just send me an email and tell me exact­ly what you did (name of pro­gram, ver­sion, user agent string, how you accessed my site, which URL gen­er­at­ed the error mes­sage, exact time (you bet­ter syn­chro­nize with NTP to make sure that your time is the same than my time) and time­zone in UTC[+-]XXXX nota­tion), so that I can inves­ti­gate how to pre­vent the block.

Send to Kin­dle

Mono build prob­lems on FreeBSD-current

I try to build mono on FreeB­SD-cur­rent (it is a depen­den­cy of some GNOME pro­gram). Unfor­tu­nate­ly this does not work cor­rect­ly.

What I see are hangs of the build. If I stop the build when it hangs and restart it, it will con­tin­ue and suc­ceed to process the build steps a lit­tle bit fur­ther, but then it hangs again.

If I ktrace the hang­ing process, I see that there is a call to wait return­ing with the error mes­sage that the child does not exist. Then there is a call to nanosleep.

It looks to me like this process missed some SIGCLD (or is wait­ing for some­thing which did not exist at all), and a loop is wait­ing for a child to exit. This loop prob­a­bly has no prop­er con­di­tion for the fact that there is no such child (any­more). As such it will stay for­ev­er in this loop.

So I grepped a litte bit around in mono and found the fol­low­ing code in <mono-src-dir>/mcs/class/Mono.Posix/Mono.Unix/UnixProcess.cs:

public void WaitForExit ()
{
    int status;
    int r;
    do {
        r = Native.Syscall.waitpid (pid, out status, (Native.WaitOptions) 0);
    } while (UnixMarshal.ShouldRetrySyscall (r));
    UnixMarshal.ThrowExceptionForLastErrorIf (r);
}

This does look a lit­tle bit as it could be relat­ed to the prob­lem I see, but Shoul­dRetrySyscall only returns true if the errno is EINTR. So this looks cor­rect. 🙁

I looked a lit­tle bit more at this file and it looks like either I do not under­stand the seman­tic of this lan­guage, or Get­ProcessSta­tus does return the return­val­ue of the wait­pid call instead of the sta­tus (which is not what it shall return to my under­stand­ing). If I am cor­rect, it can not real­ly detect the sta­tus of a process. It would be very bad if such a fun­da­men­tal thing went unno­ticed in mono…  which does not put a good light on the unit-tests (if any) or the gen­er­al test­ing of mono. For this rea­son I hope I am wrong.

I did not stop there, as this part does not look like it is the prob­lem. I found the fol­low­ing in mono/io-layer/processes.c:

static gboolean waitfor_pid (gpointer test, gpointer user_data)
{
...
    do {
        ret = waitpid (process->id, &status, WNOHANG);
    } while (errno == EINTR);

    if (ret <= 0) {
        /* Process not ready for wait */
#ifdef DEBUG
        g_message ("%s: Process %d not ready for waiting for: %s",
                   __func__, process->id, g_strerror (errno));
#endif

        return (FALSE);
    }

#ifdef DEBUG
    g_message ("%s: Process %d finished", __func__, ret);
#endif

    process->waited = TRUE;
...
}

And here we have the prob­lem, I think. I changed the (ret <= 0) to  (ret == 0 || (ret < 0 && errno != ECHILD)). This will not real­ly give the cor­rect sta­tus, but at least it should not block any­more and I should be able to see the dif­fer­ence dur­ing the build.

And now after test­ing, I see a dif­fer­ence, but the prob­lem is still there. The wait with ECHILD is gone in the loop, but there is still some loop with a sem­a­phore oper­a­tion:

62960 mono     CALL  clock_gettime(0xd,0xbf9feef8)
62960 mono     RET   clock_gettime 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  nanosleep(0xbf9fef84,0)
62960 mono     RET   nanosleep 0
62960 mono     CALL  clock_gettime(0xd,0xbf9feef8)
62960 mono     RET   clock_gettime 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   semop 0
62960 mono     CALL  nanosleep(0xbf9fef84,0)

OK, there is more going on. I think some­one with more knowl­edge about mono should have a look at this (do not only look at this semop thing, but also look why it los­es a child).

Send to Kin­dle