De­bug­ging lang/​mono – 2nd round

Today I had again some en­ergy to look at why mono fails to build on FreeBSD-cur­rent.

I de­cided to do a debug-​build of mono. This did not work ini­tially, I had to pro­duce some patches. :(

Does this mean nobody is do­ing de­bug builds of mono on FreeBSD?

I have to say, this ex­per­i­ence with lang/​mono is com­pletely un­sat­is­fy­ing.

Ok, bot­tom line, either the de­bug build seems to pre­vent a race con­di­tion in most cases (I had a lot less lockups for each of the two builds I did).

Whatever it is, I do not care ATM (if the con­fig­ure stuff is look­ing at the ar­chi­tec­ture of the sys­tem, it may be the case that the i386-​portbld-​freebsdX does not en­able some im­port­ant stuff which would be en­abled when run with i486-​portbld-​freebsdX or bet­ter). Here are the patches I used in case someone is in­ter­ested (warn­ing, copy&paste con­ver­ted tabs to spaces, you also have to ap­ply the map.c (a gen­er­ated file… maybe a touch of the right file would al­low to ap­ply this patch in the nor­mal patch stage) re­lated stuff when the build fails, else there is some parser er­ror in mono):

— mcs/​class/​Mono.Posix/Mono.Unix/UnixProcess.cs.orig       2010-​01-​29 11:34:00.592323482 +0100
+++ mcs/class/Mono.Posix/Mono.Unix/UnixProcess.cs    2010-​01-​29 11:34:18.540607357 +0100
@@ –57,7 +57,7 @@ namespace Mono.Unix {
 int r = Nat­ive.Sy­scall.wait­pid (pid, out status,
 Native.WaitOptions.WNOHANG | Native.WaitOptions.WUNTRACED);
 UnixMarshal.ThrowExceptionForLastErrorIf ®; –                       re­turn r;
+                       re­turn status;
 }

 pub­lic int Ex­it­Code {
 — mono/io-layer/processes.c.orig    2010-​01-​29 11:36:08.904331535 +0100
+++ mono/io-layer/processes.c 2010-​01-​29 11:42:21.819159544 +0100
@@ –160,7 +160,7 @@ static gboolean waitfor_​pid (gpointer te
 ret = wait­pid (process->id, &status, WNOHANG);
 } while (er­rno == EINTR);
 –       if (ret <= 0) {
+       if (ret == 0 || (ret < 0 && er­rno != ECHILD)) {
 /​* Pro­cess not ready for wait */​
 #if­def DEBUG
 g_​message (“%s: Pro­cess %d not ready for wait­ing for: %s”,
@@ –169,6 +169,17 @@ static gboolean waitfor_​pid (gpointer te

 re­turn (FALSE);
 }
+
+       if (ret < 0 && er­rno == ECHILD) {
+#if­def DEBUG
+               g_​message (“%s: Pro­cess %d does not ex­ist (any­more)”, _​_​func_​_​,
+                          process->id);
+#en­dif
+               /​* Fak­ing the re­turn status. I do not know if it is cor­rect
+                * to as­sume a suc­cess­ful exit.
+                */​
+               status = 0;
+       }

 #if­def DEBUG
 g_​message (“%s: Pro­cess %d fin­ished”, _​_​func_​_​, ret);
 — mono/metadata/mempool.c.orig      2010-​01-​29 11:58:16.871052861 +0100
+++ mono/metadata/mempool.c   2010-​01-​29 12:30:45.143367454 +0100
@@ –212,12 +212,14 @@ mono_​backtrace (int size)

         En­ter­Crit­ic­alSec­tion (&mempool_​tracing_​lock);
         g_​print (“Al­loc­at­ing %d bytes\n”, size);
+#if defined(HAVE_BACKTRACE_SYMBOLS)
         sym­bols = back­trace (ar­ray, BACKTRACE_​DEPTH);
         names = backtrace_​symbols (ar­ray, sym­bols);
         for (i = 1; i < sym­bols; ++i) {
                 g_​print (“\t%s\n”, names [i]);
         }
         free (names);
+#en­dif
         LeaveCrit­ic­alSec­tion (&mempool_​tracing_​lock);
 }
 — mono/metadata/metadata.c.orig     2010-​01-​29 11:59:38.552316989 +0100
+++ mono/metadata/metadata.c  2010-​01-​29 12:00:43.957337476 +0100
@@ –3673,12 +3673,16 @@ mono_​backtrace (int limit)
         void *array[limit];
         char **names;
         int i;
+#if defined(HAVE_BACKTRACE_SYMBOLS)
         back­trace (ar­ray, limit);
         names = backtrace_​symbols (ar­ray, limit);
         for (i =0; i < limit; ++i) {
                 g_​print (“\t%s\n”, names [i]);
         }
         g_​free (names);
+#else
+       g_​print (“No back­trace available.\n”);
+#en­dif
 }
 #en­dif
 — support/map.c.orig        2010-​01-​29 12:05:22.374653708 +0100
+++ support/map.c 2010-​01-​29 12:10:29.024412452 +0100
@@ –216,7 +216,7 @@
 #define _cnm_dump(to_t, from) do {} while (0)
 #en­dif /​* def _​CNM_​DUMP */​

-#if­def DEBUG
+#if defined(DEBUG) && !defined(__FreeBSD__)
 #define _cnm_return_val_if_overflow(to_t,from,val)  G_​STMT_​START {   \
         int     uns = _​cnm_​integral_​type_​is_​unsigned (to_​t);             \
         gint64  min = (gint64)  _​cnm_​integral_​type_​min (to_​t);           \
StumbleUponXINGBalatarinBox.netDiggGoogle GmailNetvouzPlurkSiteJotTypePad PostYahoo BookmarksVKSlashdotPocketHacker NewsDiigoBuddyMarksRedditLinkedInBibSonomyBufferEmailHatenaLiveJournalNewsVinePrintViadeoYahoo MailAIMBitty BrowserCare2 NewsEvernoteMail.RuPrintFriendlyWaneloYahoo MessengerYoolinkWebnewsStumpediaProtopage BookmarksOdnoklassnikiMendeleyInstapaperFarkCiteULikeBlinklistAOL MailTwitterGoogle+PinterestTumblrAmazon Wish ListBlogMarksDZoneDeliciousFlipboardFolkdJamespotMeneameMixiOknotiziePushaSvejoSymbaloo FeedsWhatsAppYouMobdiHITTWordPressRediff MyPageOutlook.comMySpaceDesign FloatBlogger PostApp.netDiary.RuKindle ItNUjijSegnaloTuentiWykopTwiddlaSina WeiboPinboardNetlogLineGoogle BookmarksDiasporaBookmarks.frBaiduFacebookGoogle ClassroomKakaoQzoneSMSTelegramRenrenKnownYummlyShare/​Save

Mono build prob­lems on FreeBSD-​current

I try to build mono on FreeBSD-cur­rent (it is a de­pend­ency of some GNOME pro­gram). Un­for­tu­nately this does not work cor­rectly.

What I see are hangs of the build. If I stop the build when it hangs and re­start it, it will con­tinue and suc­ceed to pro­cess the build steps a little bit fur­ther, but then it hangs again.

If I ktrace the hanging pro­cess, I see that there is a call to wait re­turn­ing with the er­ror mes­sage that the child does not ex­ist. Then there is a call to nanosleep.

It looks to me like this pro­cess missed some SIGCLD (or is wait­ing for some­thing which did not ex­ist at all), and a loop is wait­ing for a child to exit. This loop prob­ably has no proper con­di­tion for the fact that there is no such child (any­more). As such it will stay forever in this loop.

So I grepped a litte bit around in mono and found the fol­low­ing code in <mono-src-dir>/mcs/class/Mono.Posix/Mono.Unix/UnixProcess.cs:

pub­lic void Wait­F­orExit ()
{
    int status;
    int r;
    do {
        r = Nat­ive.Sy­scall.wait­pid (pid, out status, (Native.WaitOptions) 0);
    } while (UnixMarshal.ShouldRetrySyscall ®);
    UnixMarshal.ThrowExceptionForLastErrorIf ®;
}

This does look a little bit as it could be re­lated to the prob­lem I see, but ShouldRetrySy­scall only re­turns true if the er­rno is EINTR. So this looks cor­rect. :-(

I looked a little bit more at this file and it looks like either I do not un­der­stand the se­mantic of this lan­guage, or Get­Pro­cessStatus does re­turn the re­turn­value of the wait­pid call in­stead of the status (which is not what it shall re­turn to my un­der­stand­ing). If I am cor­rect, it can not really de­tect the status of a pro­cess. It would be very bad if such a fun­da­mental thing went un­noticed in mono…  which does not put a good light on the unit-​tests (if any) or the gen­eral test­ing of mono. For this reason I hope I am wrong.

I did not stop there, as this part does not look like it is the prob­lem. I found the fol­low­ing in mono/io-layer/processes.c:

static gboolean waitfor_​pid (gpointer test, gpointer user_​data)
{
…
    do {
        ret = wait­pid (process->id, &status, WNOHANG);
    } while (er­rno == EINTR);

    if (ret <= 0) {
        /​* Pro­cess not ready for wait */​
#if­def DEBUG
        g_​message (“%s: Pro­cess %d not ready for wait­ing for: %s”,
                   _​_​func_​_​, process->id, g_​strerror (er­rno));
#en­dif

        re­turn (FALSE);
    }

#if­def DEBUG
    g_​message (“%s: Pro­cess %d fin­ished”, _​_​func_​_​, ret);
#en­dif

    process->waited = TRUE; … } 

And here we have the prob­lem, I think. I changed the (ret <= 0) to  (ret == 0 || (ret < 0 && er­rno != ECHILD)). This will not really give the cor­rect status, but at least it should not block any­more and I should be able to see the dif­fer­ence dur­ing the build.

And now after test­ing, I see a dif­fer­ence, but the prob­lem is still there. The wait with ECHILD is gone in the loop, but there is still some loop with a sem­a­phore op­er­a­tion:

62960 mono     CALL  clock_gettime(0xd,0xbf9feef8)
62960 mono     RET   clock_​gettime 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  nanosleep(0xbf9fef84,0)
62960 mono     RET   nanosleep 0
62960 mono     CALL  clock_gettime(0xd,0xbf9feef8)
62960 mono     RET   clock_​gettime 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  semop(0x20c0000,0xbf9feef6,0x1)
62960 mono     RET   se­mop 0
62960 mono     CALL  nanosleep(0xbf9fef84,0)

OK, there is more go­ing on. I think someone with more know­ledge about mono should have a look at this (do not only look at this se­mop thing, but also look why it loses a child).

Fire­fox 3.6, fi­nally de­liv­er­ing a sane proxy hand­ling

At work we have to use a proxy which re­quires au­thor­iz­a­tion. With pre­vi­ous ver­sions (fire­fox 3.0.x and 3.5.y for each valid x and y) I had the prob­lem that each tab re­ques­ted to enter the mas­ter pass­word when start­ing fire­fox, to be able to fill in the proxy-​auth data (short­cut: fill in only the first re­quest, and for all oth­ers just hit return/​OK). So for each tab I had to do some­thing for the master-​password, and after that for each tab I also had to con­firm the proxy-​auth stuff.

Very an­noy­ing! Oh, I should maybe men­tion that as of this writ­ing I have 31 tabs open. Some­times there are more, some­times there are less.

Now with fire­fox 3.6 this is not the case any­more. Yeah! Great! Fi­nally only one time the mas­ter pass­word stuff, and then one time the proxy-​auth stuff, and then all tabs pro­ceed.

It took a long time since my first re­port about this, but now it is fi­nally there. This is the best im­prove­ment in 3.6 for me.

Sta­bil­ity prob­lems solved (hard­ware prob­lem)

After put­ting the disks of the 7–stable sys­tem which ex­hib­ited sta­bil­ity prob­lems into a com­pletely dif­fer­ent sys­tem (it is a ren­ted root-​server, not our own hard­ware), the sys­tem now sur­vived more than a day (and still no trace of prob­lems) with the UFS setup. Pre­vi­ously it would crash after some minutes.

The ZFS setup with the changed hard­ware had a prob­lem dur­ing the night be­fore (like al­ways after all my ZFS re­lated changes on this ma­chine), but on this ma­chine I changed all locks in ZFS from shared locks to ex­clus­ive locks (this ex­ten­ded the up­time from 4 – 6 hours to “un­til I re­booted the morn­ing after be­cause of hanging pro­cesses”), so this may be be­cause of this. I do not know yet if we will test the ZFS setup with the pure 7-​stable source we use now or not (the goal was to get back a stable sys­tem, in­stead of play­ing around with un­re­lated stuff).

It looks like some kind of hard­ware prob­lem was un­covered by up­dat­ing from 7.1 to 7.2 (and 7-​stable sub­sequently). This new ma­chine has a com­pletely dif­fer­ent chip­set, a new CPU and RAM and PSU and … so I do not really know what caused this (but the fact that the pre­vi­ous sys­tem did not re­cog­nize the CPU after re­pla­cing it with a big­ger one and the ob­ser­va­tion that only shared locks with a spe­cific us­age pat­tern where af­fected lets me point to­wards miss­ing mi­cro­code up­dates…).