Some fix­es for ZFS on 7‑stable (more testers wanted)

Due to the prob­lems with a 7‑stable machine, I had a look at some unmerged fix­es for ZFS (58 changes not merged).

I back­port­ed some of those changes from 8‑stable to 7‑stable, I have this run­ning on one 7‑stable machine. I would like to get some more feed­back for it (even an “it works for me” would be great). The main part of this change is that the FreeB­SD taskqueue is used now instead of the open­so­laris one (and some oth­er changes which may improve the ZFS experience).

It would also be nice if some­one could have a look at the FIRST_THREAD_IN_PROC part. Can there be more than one thread at this place (I do not think so) and I should use FOREACH_THREAD_IN_PROC_instead?

How to apply:

  • cd /usr/src/
  • fetch http://www.Leidinger.net/FreeBSD/test/releng7_zfs_merge3.diff
  • fetch http://www.Leidinger.net/FreeBSD/test/opensolaris_taskq.c
  • fetch http://www.Leidinger.net/FreeBSD/test/taskq.h
  • mv taskq.h sys/cddl/contrib/opensolaris/uts/common/sys/taskq.h
  • mv opensolaris_taskq.c sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c
  • patch ‑p 0 –qui­et <releng7_zfs_merge3.diff
  • ignore the 2 .rej files
  • rm ‑f sys/cddl/compat/opensolaris/sys/taskq_impl.h*
  • rm ‑f sys/cddl/compat/opensolaris/sys/taskq.h*
  • rm ‑f sys/cddl/contrib/opensolaris/uts/common/os/taskq.c*
  • rebuild ker­nel

I do not list all of those 16 of 58 out­stand­ing patch­es which are cov­ered here, a detailed list can be found on the sta­ble and fs mailinglists.

Sta­bi­liz­ing 7‑stable…

The 7‑stable sys­tem on which I have sta­bil­i­ty prob­lems after an update from 7.1 to 7.2/7‑stable is now semi-stable.

The watch­dog reboots after one minute of no reac­tion (cur­rent­ly it is able to run 3 – 4 hours), and the jails come up with­out prob­lems now.

The prob­lem with the jails was, that e.g. the mysql-server start­up went into the STOP state because TTY-input was “request­ed”. I solved the prob­lem by using /dev/null as input on jail-startup. On ‑cur­rent I do not see this behav­ior (I have a 9‑current sys­tem with a lot of jails which reboots every X days, and there mysql does not go into the STOP state).

I also start the jails in the back­ground, so that one block­ing jail does not block every­thing (done like in ‑cur­rent).

To say this with code:

--- /usr/src/etc/rc.d/jail      2009-02-07 15:04:35.000000000 +0100
+++ /etc/rc.d/jail      2009-12-16 17:03:12.000000000 +0100
@@ -556,7 +556,8 @@
 fi
 _tmp_jail=${_tmp_dir}/jail.$$
 eval ${_setfib} jail ${_flags} -i ${_rootdir} ${_hostname} \
-                       \\"${_addrl}\\" ${_exec_start} > ${_tmp_jail} 2>&1
+                       \\"${_addrl}\\" ${_exec_start} > ${_tmp_jail} 2>&1 \\
+                       </dev/null

 if [ "$?" -eq 0 ] ; then
 _jail_id=$(head -1 ${_tmp_jail})
@@ -623,4 +624,4 @@
 if [ -n "$*" ]; then
 jail_list="$*"
 fi
-run_rc_command "${cmd}"
+run_rc_command "${cmd}" &

I also iden­ti­fied 57 patch­es for ZFS which are in 8‑stable, but not in 7‑stable (I do not think they could solve the dead­lock, but I do not real­ly know, and now that there is one FS on ZFS, I would like to get as much fixed as pos­si­ble). Some of them should be merged, some would be nice to merge, and some I do not care much about (but if they are easy to merge, why not…). I already have all revi­sions and the cor­re­spond­ing com­mit logs avail­able in an email-draft.

Now I just need to write a lit­tle bit of text and find some peo­ple will­ing to help (some of the changes need a review if they are applic­a­ble to 7‑stable, and every­thing should be test­ed on a scratch-box).

Sta­bil­i­ty prob­lems with 7‑stable

On the machine where I host this blog, I have/had some sta­bil­i­ty problems.

Last week I updat­ed the machine from FreeB­SD 7.1‑pX to 7.2‑p5 (GENERIC ker­nel in both cas­es). 5 – 10 Min­utes after the reboot into the new ver­sion the machine had a dead­lock. After some road­blocks (order­ing a KVM-switch from the hoster, the KVM-switch not work­ing with a proxy (dur­ing lunchtime at work), a bro­ken video-capture of the KVM-switch and a replace­ment on Mon­day morn­ing to not pay the WE-fees), I spend a big part of the night to get it sta­ble. I tried dis­abling SMP, enabling INVARIANTS and WITNESS, chang­ing the sched­uler, cut­ting the soft­ware mir­ror (to rule out a mis­match between the con­tent of the disks after all the hard reboots) and updat­ing to 7‑stable.

Unfor­tu­nate­ly noth­ing helped. 🙁

Googling a lit­tle bit around (it is a AMD Dual-Core sys­tem with NVidia MCP61 chipset) was lead­ing me to a post on the mail­inglists from 2008 which talks about an issue with the buffer cache. I do not know if this is still an issue (I have send a email to kib@ to ask about it), and my sce­nario is not the same as the one which is described in the mail, but because of this I decid­ed to switch one of the two UFS mir­rors to ZFS.

The first boot into the ZFS caused again a reboot after some min­utes (I do not know if it was because of a mem­o­ry exhaust­ed pan­ic, or because of a dead­lock), but as I did not tune the ker­nel for ZFS I am tempt­ed to believe that I should not count that. Now, after tun­ing the ker­nel (increas­ing the kmem_size to 700M, no prefetch­ing, lim­it­ing the ARC to 40M) it is up since near­ly 2h (as of this writ­ting… cross­ing fin­gers). Before it was not able to sur­vive more than some min­utes with just the jail for the mails up. Now I not only have the mail-jail up, but also the jail for the blog (one jail still dis­abled, but I will take care about that after this post).

I do not know if only increas­ing the kmem_size would have helped with the prob­lem, but as I was test­ing a GENERIC ker­nel + gmir­ror mod­ule in the begin­ning, I expect­ed that the auto-tuning of this val­ue should have been enough for such a sim­ple set­up (2GB RAM, 2 disks with 3 par­ti­tions each, one par­ti­tion pair for root, one for swap, one for the jails).

I hope that I sta­bi­lized the sys­tem now. It may be the case that I will test some patch­es in case some­one comes up with some­thing, so do not be sur­prised if the blog and email to me is a lit­tle bit flaky.

Progress with Net­work­er bugs

Our bug with savep­n­pc which caus­es the post-command to start one minute after the pre-command even if the back­up is not done yet is now hope­ful­ly near the res­o­lu­tion point. We opened a prob­lem report for this in July, this week we where told that there is a patch for it avail­able. The bad part is, that it is avail­able since 3 weeks and nobody told us. The good part is, that we have it installed on a machine now to see if it helps (all zones there seem to be OK, but we have zones where it some­times works and some­times fails, so we are not 100% sure, but we hope the best). We where told that it will be includ­ed in Net­work­er 7.5.1.8.

Our oth­er issues are now at least not in a helpdesk-loop any­more, they seem to have reached the devel­op­ers now.

FreeNAS & Sen­sors for FreeBSD

This WE I was told that FreeNAS seems to want to move from FreeB­SD to Lin­ux (since then it seems there could be a lin­ux and a FreeB­SD ver­sion). One of the rea­sons seems to be a miss­ing sen­sors framework.

As I was com­mit­ting a port of the OpenB­SD sen­sors frame­work (pro­duced as part of the Google Sum­mer of Code 2007) to FreeB­SD and had to remove it after­wards because one com­mit­ter com­plained very loud­ly, I was asked what the sta­tus of this is.

The short sta­tus is: Nobody is doing some­thing about it.

Before I explain the long sta­tus, I give  a short overview what this sen­sors frame­work is:

  • a ker­nel API which allows to add sensors
  • an inter­face for the user­land to query the sen­sor data
  • some basic user­land code to show and log the sen­sor info

The API and the query inter­face are more or less inde­pen­dent. For the user­land code it was more a log­ging infra­struc­ture than a real mon­i­tor­ing solu­tion. The rea­son was the real mon­i­tor­ing solu­tions already exist (Nagios, snm­pd, …) and can be adapt­ed to query the sen­sors. Ide­al­ly a query in user­land should be han­dled by a library instead of direct­ly access­ing the sysctl inter­face, this way the kernel<->userland inter­face would be abstract­ed away (and could b replaced as needs arise). This was not done, it was some­thing to be done lat­er (Rome was not build in a day).

The user­land inter­face also only cared about dumb sen­sors (those which you need to query man­u­al­ly to get the infor­ma­tion), smart sen­sors (those which are able to send events them­self) where not tak­en care about in the sense of real­ly send­ing sensor-triggered events, but the ker­nel API allowed to add such sen­sors. The sysctl inter­face has no way of send­ing events, but FreeB­SD already has an event inter­face (devd is tak­ing care about it). It would have been not a prob­lem to send events via this chan­nel and let an user­land library take care about the deliv­ery togeth­er with oth­er sensor-data in userland.

And now the long sta­tus is:

PHK com­plained loud­ly about it. First he said he did not look at it but he com­plained that is not good regard­less. After a lot of nag­ging from me he had a look at it and was not hap­py about the time stuff in it (short: the FreeB­SD time­counter code is bet­ter). This was not a prob­lem in my opin­ion, we could have dis­abled this part with­out prob­lems. After such an offer from me, he com­plained that the sen­sors frame­work uses the sysctl inter­face instead of an entry in /dev.

At this point in time already sev­er­al user­land util­i­ties used the sysctl frame­work to query for sta­tus data in the ker­nel. So there was already prece­dence for such an use of it. Lat­er some more such uses where added too (e.g. the proc­stat stuff by core team mem­ber Robert Watson).

I saved some of the cor­re­spond­ing mails (to pub­lic mail­ing lists) in a mbox file, read the mess your­self if you want.

The bot­tom line is: Sev­er­al com­mit­ters (even some which we could call high pro­file com­mit­ters) told me that they do not see a prob­lem in the use of the sysctl inter­face. They do not seem to want to tell it in pub­lic (nobody of them voiced their opin­ion in the thread, so do not ask me who those peo­ple are). I am not inter­est­ed in invest­ing more of my spare time into fight­ing wind­mills (it looks like this to me).

So, if some­one is inter­st­ed in the code, r172631 has it. In the per­force repos­i­to­ry you can maybe find some sen­sors. I think most of it can still be used with­out much changes.

If some­one tries it with a more recent FreeB­SD, please drop me a note if it just applies fine, or a patch (or an URL to it) if it needs some mod­i­fi­ca­tions. Who knows, maybe in a future project it may be use­ful for me.

If there is enough inter­est by sev­er­al peo­ple, I can even put up a wiki page where those peo­ple can coor­di­nate, but that is most prob­a­bly all I am will­ing to invest fur­ther into this (at least in my unpaid time).