Some fixes for ZFS on 7-​stable (more test­ers wanted)

Due to the prob­lems with a 7–stable ma­chine, I had a look at some un­merged fixes for ZFS (58 changes not merged).

I back­por­ted some of those changes from 8-​stable to 7-​stable, I have this run­ning on one 7-​stable ma­chine. I would like to get some more feed­back for it (even an “it works for me” would be great). The main part of this change is that the FreeBSD taskqueue is used now in­stead of the opensol­ar­is one (and some oth­er changes which may im­prove the ZFS ex­per­i­ence).

It would also be nice if someone could have a look at the FIRST_​THREAD_​IN_​PROC part. Can there be more than one thread at this place (I do not think so) and I should use FOREACH_​THREAD_​IN_​PROC_​instead?

How to ap­ply:

  • cd /​usr/​src/​
  • fetch http://​www​.Leidinger​.net/FreeBSD/test/releng7_zfs_merge3.diff
  • fetch http://​www​.Leidinger​.net/FreeBSD/test/opensolaris_taskq.c
  • fetch http://​www​.Leidinger​.net/​F​r​e​e​B​S​D​/​t​e​s​t​/​t​a​s​k​q.h
  • mv taskq.h sys/cddl/contrib/opensolaris/uts/common/sys/taskq.h
  • mv opensolaris_taskq.c sys/​cddl/​com­pat/opensolaris/kern/opensolaris_taskq.c
  • patch –p 0 –quiet <releng7_zfs_merge3.diff
  • ig­nore the 2 .rej files
  • rm –f sys/cddl/compat/opensolaris/sys/taskq_impl.h*
  • rm –f sys/cddl/compat/opensolaris/sys/taskq.h*
  • rm –f sys/cddl/contrib/opensolaris/uts/common/os/taskq.c*
  • re­build ker­nel

I do not list all of those 16 of 58 out­stand­ing patches which are covered here, a de­tailed list can be found on the stable and fs mailing­lists.

Sta­bil­iz­ing 7-​stable…

The 7–stable sys­tem on which I have sta­bil­ity prob­lems af­ter an up­date from 7.1 to 7.2/7-stable is now semi-​stable.

The watch­dog re­boots af­ter one minute of no re­ac­tion (cur­rently it is able to run 3 – 4 hours), and the jails come up without prob­lems now.

The prob­lem with the jails was, that e.g. the mysql–server star­tup went in­to the STOP state be­cause TTY-​input was “re­ques­ted”. I solved the prob­lem by us­ing /​dev/​null as in­put on jail-​startup. On –cur­rent I do not see this be­ha­vi­or (I have a 9-​current sys­tem with a lot of jails which re­boots every X days, and there mysql does not go in­to the STOP state).

I also start the jails in the back­ground, so that one block­ing jail does not block everything (done like in –cur­rent).

To say this with code:

— /usr/src/etc/rc.d/jail      2009-​02-​07 15:04:35.000000000 +0100
+++ /etc/rc.d/jail      2009-​12-​16 17:03:12.000000000 +0100
@@ –556,7 +556,8 @@
 eval ${_​setfib} jail ${_​flags} –i ${_​rootdir} ${_​hostname} \ –                       \\”${_​addrl}\\” ${_​exec_​start} > ${_​tmp_​jail} 2>&1
+                       \\”${_​addrl}\\” ${_​exec_​start} > ${_​tmp_​jail} 2>&1 \\
+                       </​dev/​null

 if [ “$?” –eq 0 ] ; then
 _jail_id=$(head –1 ${_​tmp_​jail})
@@ –623,4 +624,4 @@
 if [ –n “$*” ]; then
–run_​rc_​command “${cmd}“
+run_​rc_​command “${cmd}” &

I also iden­ti­fied 57 patches for ZFS which are in 8-​stable, but not in 7-​stable (I do not think they could solve the dead­lock, but I do not really know, and now that there is one FS on ZFS, I would like to get as much fixed as pos­sible). Some of them should be merged, some would be nice to merge, and some I do not care much about (but if they are easy to merge, why not…). I already have all re­vi­sions and the cor­res­pond­ing com­mit logs avail­able in an email–draft.

Now I just need to write a little bit of text and find some people will­ing to help (some of the changes need a re­view if they are ap­plic­able to 7-​stable, and everything should be tested on a scratch-​box).

Sta­bil­ity prob­lems with 7-​stable

On the ma­chine where I host this blog, I have/​had some sta­bil­ity prob­lems.

Last week I up­dated the ma­chine from FreeBSD 7.1-pX to 7.2-p5 (GENERIC ker­nel in both cases). 5 – 10 Minutes af­ter the re­boot in­to the new ver­sion the ma­chine had a dead­lock. Af­ter some road­b­locks (or­der­ing a KVM-​switch from the hoster, the KVM-​switch not work­ing with a proxy (dur­ing lunch­time at work), a broken video-​capture of the KVM-​switch and a re­place­ment on Monday morn­ing to not pay the WE-​fees), I spend a big part of the night to get it stable. I tried dis­abling SMP, en­abling INVARIANTS and WITNESS, chan­ging the sched­uler, cut­ting the soft­ware mir­ror (to rule out a mis­match between the con­tent of the disks af­ter all the hard re­boots) and up­dat­ing to 7-​stable.

Un­for­tu­nately noth­ing helped. 🙁

Googling a little bit around (it is a AMD Dual-​Core sys­tem with NVidia MCP61 chip­set) was lead­ing me to a post on the mailing­lists from 2008 which talks about an is­sue with the buf­fer cache. I do not know if this is still an is­sue (I have send a email to kib@ to ask about it), and my scen­ario is not the same as the one which is de­scribed in the mail, but be­cause of this I de­cided to switch one of the two UFS mir­rors to ZFS.

The first boot in­to the ZFS caused again a re­boot af­ter some minutes (I do not know if it was be­cause of a memory ex­hausted pan­ic, or be­cause of a dead­lock), but as I did not tune the ker­nel for ZFS I am temp­ted to be­lieve that I should not count that. Now, af­ter tun­ing the ker­nel (in­creas­ing the kmem_​size to 700M, no prefetch­ing, lim­it­ing the ARC to 40M) it is up since nearly 2h (as of this writ­ting… cross­ing fin­gers). Be­fore it was not able to sur­vive more than some minutes with just the jail for the mails up. Now I not only have the mail-​jail up, but also the jail for the blog (one jail still dis­abled, but I will take care about that af­ter this post).

I do not know if only in­creas­ing the kmem_​size would have helped with the prob­lem, but as I was test­ing a GENERIC ker­nel + gmir­ror mod­ule in the be­gin­ning, I ex­pec­ted that the auto-​tuning of this value should have been enough for such a sim­ple setup (2GB RAM, 2 disks with 3 par­ti­tions each, one par­ti­tion pair for root, one for swap, one for the jails).

I hope that I sta­bil­ized the sys­tem now. It may be the case that I will test some patches in case someone comes up with some­thing, so do not be sur­prised if the blog and email to me is a little bit flaky.

Pro­gress with Net­work­er bugs

Our bug with savepn­pc which causes the post-​command to start one minute af­ter the pre-​command even if the backup is not done yet is now hope­fully near the res­ol­u­tion point. We opened a prob­lem re­port for this in Ju­ly, this week we where told that there is a patch for it avail­able. The bad part is, that it is avail­able since 3 weeks and nobody told us. The good part is, that we have it in­stalled on a ma­chine now to see if it helps (all zones there seem to be OK, but we have zones where it some­times works and some­times fails, so we are not 100% sure, but we hope the be­st). We where told that it will be in­cluded in Net­work­er

Our oth­er is­sues are now at least not in a helpdesk-​loop any­more, they seem to have reached the de­velopers now.

FreeNAS & Sensors for FreeBSD

This WE I was told that FreeNAS seems to want to move from FreeBSD to Linux (since then it seems there could be a linux and a FreeBSD ver­sion). One of the reas­ons seems to be a miss­ing sensors frame­work.

As I was com­mit­ting a port of the OpenBSD sensors frame­work (pro­duced as part of the Google Sum­mer of Code 2007) to FreeBSD and had to re­move it af­ter­wards be­cause one com­mit­ter com­plained very loudly, I was asked what the status of this is.

The short status is: Nobody is do­ing some­thing about it.

Be­fore I ex­plain the long status, I give  a short over­view what this sensors frame­work is:

  • a ker­nel API which al­lows to add sensors
  • an in­ter­face for the user­land to query the sensor data
  • some ba­sic user­land code to show and log the sensor in­fo

The API and the query in­ter­face are more or less in­de­pend­ent. For the user­land code it was more a log­ging in­fra­struc­ture than a real mon­it­or­ing solu­tion. The reas­on was the real mon­it­or­ing solu­tions already ex­ist (Nagios, sn­m­pd, …) and can be ad­ap­ted to query the sensors. Ideally a query in user­land should be handled by a lib­rary in­stead of dir­ectly ac­cess­ing the sy­sctl in­ter­face, this way the kernel<->userland in­ter­face would be ab­strac­ted away (and could b re­placed as needs arise). This was not done, it was some­thing to be done later (Rome was not build in a day).

The user­land in­ter­face also only cared about dumb sensors (those which you need to query manu­ally to get the in­form­a­tion), smart sensors (those which are able to send events them­self) where not taken care about in the sense of really send­ing sensor-​triggered events, but the ker­nel API al­lowed to add such sensors. The sy­sctl in­ter­face has no way of send­ing events, but FreeBSD already has an event in­ter­face (devd is tak­ing care about it). It would have been not a prob­lem to send events via this chan­nel and let an user­land lib­rary take care about the de­liv­ery to­geth­er with oth­er sensor-​data in user­land.

And now the long status is:

PHK com­plained loudly about it. First he said he did not look at it but he com­plained that is not good re­gard­less. Af­ter a lot of nag­ging from me he had a look at it and was not happy about the time stuff in it (short: the FreeBSD time­coun­ter code is bet­ter). This was not a prob­lem in my opin­ion, we could have dis­abled this part without prob­lems. Af­ter such an of­fer from me, he com­plained that the sensors frame­work uses the sy­sctl in­ter­face in­stead of an entry in /​dev.

At this point in time already sev­er­al user­land util­it­ies used the sy­sctl frame­work to query for status data in the ker­nel. So there was already pre­ced­ence for such an use of it. Later some more such uses where ad­ded too (e.g. the proc­stat stuff by core team mem­ber Robert Wat­son).

I saved some of the cor­res­pond­ing mails (to pub­lic mail­ing lists) in a mbox file, read the mess your­self if you want.

The bot­tom line is: Sev­er­al com­mit­ters (even some which we could call high pro­file com­mit­ters) told me that they do not see a prob­lem in the use of the sy­sctl in­ter­face. They do not seem to want to tell it in pub­lic (nobody of them voiced their opin­ion in the thread, so do not ask me who those people are). I am not in­ter­ested in in­vest­ing more of my spare time in­to fight­ing wind­mills (it looks like this to me).

So, if someone is in­tersted in the code, r172631 has it. In the per­for­ce re­pos­it­ory you can may­be find some sensors. I think most of it can still be used without much changes.

If someone tries it with a more re­cent FreeBSD, please drop me a note if it just ap­plies fine, or a patch (or an URL to it) if it needs some modi­fic­a­tions. Who knows, may­be in a fu­ture pro­ject it may be use­ful for me.

If there is enough in­terest by sev­er­al people, I can even put up a wiki page where those people can co­ordin­ate, but that is most prob­ably all I am will­ing to in­vest fur­ther in­to this (at least in my un­paid time).