stability problems | Alexander Leidinger

Stability problems solved (hardware problem)

After putting the disks of the 7‑stable system which exhibited stability problems into a completely different system (it is a rented root-server, not our own hardware), the system now survived more than a day (and still no trace of problems) with the UFS setup. Previously it would crash after some minutes.

The ZFS setup with the changed hardware had a problem during the night before (like always after all my ZFS related changes on this machine), but on this machine I changed all locks in ZFS from shared locks to exclusive locks (this extended the uptime from 4 – 6 hours to “until I rebooted the morning after because of hanging processes”), so this may be because of this. I do not know yet if we will test the ZFS setup with the pure 7‑stable source we use now or not (the goal was to get back a stable system, instead of playing around with unrelated stuff).

It looks like some kind of hardware problem was uncovered by updating from 7.1 to 7.2 (and 7‑stable subsequently). This new machine has a completely different chipset, a new CPU and RAM and PSU and … so I do not really know what caused this (but the fact that the previous system did not recognize the CPU after replacing it with a bigger one and the observation that only shared locks with a specific usage pattern where affected lets me point towards missing microcode updates…).

Share/Save

I merged a lot of ZFS patches to 7‑stable

During the last weeks I identified 64 patches for ZFS which are in 8‑stable but not in 7‑stable. For 56 of them I had a deeper look and most of them are commited now to 7‑stable. The ones of those 56 which I did not commit are not applicable to 7‑stable (infrastructure differences between 8 and 7).

Unfortunately this did not solve the stability problems I have on a 7‑stable system.

I also committed a diff reduction (between 8‑stable and 7‑stable) patch which also fixed some not so harmless mismerges (mem-leak and initializing the same mutex twice at different places). No idea yet if it helps in my case.

I also want to merge the new arc reclaim logic from head to 8‑stable and 7‑stable. Maybe I can do this tomorrow.

Currently I run a test with a kernel where the shared locks for ZFS are switched to exclusive locks.

Share/Save

Stabilizing 7‑stable…

The 7‑stable system on which I have stability problems after an update from 7.1 to 7.2/7‑stable is now semi-stable.

The watchdog reboots after one minute of no reaction (currently it is able to run 3 – 4 hours), and the jails come up without problems now.

The problem with the jails was, that e.g. the mysql-server startup went into the STOP state because TTY-input was “requested”. I solved the problem by using /dev/null as input on jail-startup. On ‑current I do not see this behavior (I have a 9‑current system with a lot of jails which reboots every X days, and there mysql does not go into the STOP state).

I also start the jails in the background, so that one blocking jail does not block everything (done like in ‑current).

To say this with code:

--- /usr/src/etc/rc.d/jail      2009-02-07 15:04:35.000000000 +0100
+++ /etc/rc.d/jail      2009-12-16 17:03:12.000000000 +0100
@@ -556,7 +556,8 @@
 fi
 _tmp_jail=${_tmp_dir}/jail.$$
 eval ${_setfib} jail ${_flags} -i ${_rootdir} ${_hostname} \
-                       \\"${_addrl}\\" ${_exec_start} > ${_tmp_jail} 2>&1
+                       \\"${_addrl}\\" ${_exec_start} > ${_tmp_jail} 2>&1 \\
+                       </dev/null

 if [ "$?" -eq 0 ] ; then
 _jail_id=$(head -1 ${_tmp_jail})
@@ -623,4 +624,4 @@
 if [ -n "$*" ]; then
 jail_list="$*"
 fi
-run_rc_command "${cmd}"
+run_rc_command "${cmd}" &

I also identified 57 patches for ZFS which are in 8‑stable, but not in 7‑stable (I do not think they could solve the deadlock, but I do not really know, and now that there is one FS on ZFS, I would like to get as much fixed as possible). Some of them should be merged, some would be nice to merge, and some I do not care much about (but if they are easy to merge, why not…). I already have all revisions and the corresponding commit logs available in an email-draft.

Now I just need to write a little bit of text and find some people willing to help (some of the changes need a review if they are applicable to 7‑stable, and everything should be tested on a scratch-box).

Share/Save

Stability problems with 7‑stable

On the machine where I host this blog, I have/had some stability problems.

Last week I updated the machine from FreeBSD 7.1‑pX to 7.2‑p5 (GENERIC kernel in both cases). 5 – 10 Minutes after the reboot into the new version the machine had a deadlock. After some roadblocks (ordering a KVM-switch from the hoster, the KVM-switch not working with a proxy (during lunchtime at work), a broken video-capture of the KVM-switch and a replacement on Monday morning to not pay the WE-fees), I spend a big part of the night to get it stable. I tried disabling SMP, enabling INVARIANTS and WITNESS, changing the scheduler, cutting the software mirror (to rule out a mismatch between the content of the disks after all the hard reboots) and updating to 7‑stable.

Unfortunately nothing helped. 🙁

Googling a little bit around (it is a AMD Dual-Core system with NVidia MCP61 chipset) was leading me to a post on the mailinglists from 2008 which talks about an issue with the buffer cache. I do not know if this is still an issue (I have send a email to kib@ to ask about it), and my scenario is not the same as the one which is described in the mail, but because of this I decided to switch one of the two UFS mirrors to ZFS.

The first boot into the ZFS caused again a reboot after some minutes (I do not know if it was because of a memory exhausted panic, or because of a deadlock), but as I did not tune the kernel for ZFS I am tempted to believe that I should not count that. Now, after tuning the kernel (increasing the kmem_size to 700M, no prefetching, limiting the ARC to 40M) it is up since nearly 2h (as of this writting… crossing fingers). Before it was not able to survive more than some minutes with just the jail for the mails up. Now I not only have the mail-jail up, but also the jail for the blog (one jail still disabled, but I will take care about that after this post).

I do not know if only increasing the kmem_size would have helped with the problem, but as I was testing a GENERIC kernel + gmirror module in the beginning, I expected that the auto-tuning of this value should have been enough for such a simple setup (2GB RAM, 2 disks with 3 partitions each, one partition pair for root, one for swap, one for the jails).

I hope that I stabilized the system now. It may be the case that I will test some patches in case someone comes up with something, so do not be surprised if the blog and email to me is a little bit flaky.

Share/Save

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30