Sta­bil­i­ty prob­lems solved (hard­ware problem)

After putting the disks of the 7‑stable sys­tem which exhib­it­ed sta­bil­i­ty prob­lems into a com­plete­ly dif­fer­ent sys­tem (it is a rent­ed root-server, not our own hard­ware), the sys­tem now sur­vived more than a day (and still no trace of prob­lems) with the UFS set­up. Pre­vi­ous­ly it would crash after some minutes.

The ZFS set­up with the changed hard­ware had a prob­lem dur­ing the night before (like always after all my ZFS relat­ed changes on this machine), but on this machine I changed all locks in ZFS from shared locks to exclu­sive locks (this extend­ed the uptime from 4 – 6 hours to “until I reboot­ed the morn­ing after because of hang­ing process­es”), so this may be because of this. I do not know yet if we will test the ZFS set­up with the pure 7‑stable source we use now or not (the goal was to get back a sta­ble sys­tem, instead of play­ing around with unre­lat­ed stuff).

It looks like some kind of hard­ware prob­lem was uncov­ered by updat­ing from 7.1 to 7.2 (and 7‑stable sub­se­quent­ly). This new machine has a com­plete­ly dif­fer­ent chipset, a new CPU and RAM and PSU and … so I do not real­ly know what caused this (but the fact that the pre­vi­ous sys­tem did not rec­og­nize the CPU after replac­ing it with a big­ger one and the obser­va­tion that only shared locks with a spe­cif­ic usage pat­tern where affect­ed lets me point towards miss­ing microc­ode updates…).

I merged a lot of ZFS patch­es to 7‑stable

Dur­ing the last weeks I iden­ti­fied 64 patch­es for ZFS which are in 8‑stable but not in 7‑stable. For 56 of them I had a deep­er look and most of them are com­mit­ed now to 7‑stable. The ones of those 56 which I did not com­mit are not applic­a­ble to 7‑stable (infra­struc­ture dif­fer­ences between 8 and 7).

Unfor­tu­nate­ly this did not solve the sta­bil­i­ty prob­lems I have on a 7‑stable system.

I also com­mit­ted a diff reduc­tion (between 8‑stable and 7‑stable) patch which also fixed some not so harm­less mis­merges (mem-leak and ini­tial­iz­ing the same mutex twice at dif­fer­ent places). No idea yet if it helps in my case.

I also want to merge the new arc reclaim log­ic from head to 8‑stable and 7‑stable. Maybe I can do this tomorrow.

Cur­rent­ly I run a test with a ker­nel where the shared locks for ZFS are switched to exclu­sive locks.

Sta­bi­liz­ing 7‑stable…

The 7‑stable sys­tem on which I have sta­bil­i­ty prob­lems after an update from 7.1 to 7.2/7‑stable is now semi-stable.

The watch­dog reboots after one minute of no reac­tion (cur­rent­ly it is able to run 3 – 4 hours), and the jails come up with­out prob­lems now.

The prob­lem with the jails was, that e.g. the mysql-server start­up went into the STOP state because TTY-input was “request­ed”. I solved the prob­lem by using /dev/null as input on jail-startup. On ‑cur­rent I do not see this behav­ior (I have a 9‑current sys­tem with a lot of jails which reboots every X days, and there mysql does not go into the STOP state).

I also start the jails in the back­ground, so that one block­ing jail does not block every­thing (done like in ‑cur­rent).

To say this with code:

--- /usr/src/etc/rc.d/jail      2009-02-07 15:04:35.000000000 +0100
+++ /etc/rc.d/jail      2009-12-16 17:03:12.000000000 +0100
@@ -556,7 +556,8 @@
 fi
 _tmp_jail=${_tmp_dir}/jail.$$
 eval ${_setfib} jail ${_flags} -i ${_rootdir} ${_hostname} \
-                       \\"${_addrl}\\" ${_exec_start} > ${_tmp_jail} 2>&1
+                       \\"${_addrl}\\" ${_exec_start} > ${_tmp_jail} 2>&1 \\
+                       </dev/null

 if [ "$?" -eq 0 ] ; then
 _jail_id=$(head -1 ${_tmp_jail})
@@ -623,4 +624,4 @@
 if [ -n "$*" ]; then
 jail_list="$*"
 fi
-run_rc_command "${cmd}"
+run_rc_command "${cmd}" &

I also iden­ti­fied 57 patch­es for ZFS which are in 8‑stable, but not in 7‑stable (I do not think they could solve the dead­lock, but I do not real­ly know, and now that there is one FS on ZFS, I would like to get as much fixed as pos­si­ble). Some of them should be merged, some would be nice to merge, and some I do not care much about (but if they are easy to merge, why not…). I already have all revi­sions and the cor­re­spond­ing com­mit logs avail­able in an email-draft.

Now I just need to write a lit­tle bit of text and find some peo­ple will­ing to help (some of the changes need a review if they are applic­a­ble to 7‑stable, and every­thing should be test­ed on a scratch-box).

Sta­bil­i­ty prob­lems with 7‑stable

On the machine where I host this blog, I have/had some sta­bil­i­ty problems.

Last week I updat­ed the machine from FreeB­SD 7.1‑pX to 7.2‑p5 (GENERIC ker­nel in both cas­es). 5 – 10 Min­utes after the reboot into the new ver­sion the machine had a dead­lock. After some road­blocks (order­ing a KVM-switch from the hoster, the KVM-switch not work­ing with a proxy (dur­ing lunchtime at work), a bro­ken video-capture of the KVM-switch and a replace­ment on Mon­day morn­ing to not pay the WE-fees), I spend a big part of the night to get it sta­ble. I tried dis­abling SMP, enabling INVARIANTS and WITNESS, chang­ing the sched­uler, cut­ting the soft­ware mir­ror (to rule out a mis­match between the con­tent of the disks after all the hard reboots) and updat­ing to 7‑stable.

Unfor­tu­nate­ly noth­ing helped. 🙁

Googling a lit­tle bit around (it is a AMD Dual-Core sys­tem with NVidia MCP61 chipset) was lead­ing me to a post on the mail­inglists from 2008 which talks about an issue with the buffer cache. I do not know if this is still an issue (I have send a email to kib@ to ask about it), and my sce­nario is not the same as the one which is described in the mail, but because of this I decid­ed to switch one of the two UFS mir­rors to ZFS.

The first boot into the ZFS caused again a reboot after some min­utes (I do not know if it was because of a mem­o­ry exhaust­ed pan­ic, or because of a dead­lock), but as I did not tune the ker­nel for ZFS I am tempt­ed to believe that I should not count that. Now, after tun­ing the ker­nel (increas­ing the kmem_size to 700M, no prefetch­ing, lim­it­ing the ARC to 40M) it is up since near­ly 2h (as of this writ­ting… cross­ing fin­gers). Before it was not able to sur­vive more than some min­utes with just the jail for the mails up. Now I not only have the mail-jail up, but also the jail for the blog (one jail still dis­abled, but I will take care about that after this post).

I do not know if only increas­ing the kmem_size would have helped with the prob­lem, but as I was test­ing a GENERIC ker­nel + gmir­ror mod­ule in the begin­ning, I expect­ed that the auto-tuning of this val­ue should have been enough for such a sim­ple set­up (2GB RAM, 2 disks with 3 par­ti­tions each, one par­ti­tion pair for root, one for swap, one for the jails).

I hope that I sta­bi­lized the sys­tem now. It may be the case that I will test some patch­es in case some­one comes up with some­thing, so do not be sur­prised if the blog and email to me is a lit­tle bit flaky.