On the machine where I host this blog, I have/had some stability problems.
Last week I updated the machine from FreeBSD 7.1-pX to 7.2-p5 (GENERIC kernel in both cases). 5 – 10 Minutes after the reboot into the new version the machine had a deadlock. After some roadblocks (ordering a KVM-switch from the hoster, the KVM-switch not working with a proxy (during lunchtime at work), a broken video-capture of the KVM-switch and a replacement on Monday morning to not pay the WE-fees), I spend a big part of the night to get it stable. I tried disabling SMP, enabling INVARIANTS and WITNESS, changing the scheduler, cutting the software mirror (to rule out a mismatch between the content of the disks after all the hard reboots) and updating to 7-stable.
Unfortunately nothing helped.
Googling a little bit around (it is a AMD Dual–Core system with NVidia MCP61 chipset) was leading me to a post on the mailinglists from 2008 which talks about an issue with the buffer cache. I do not know if this is still an issue (I have send a email to kib@ to ask about it), and my scenario is not the same as the one which is described in the mail, but because of this I decided to switch one of the two UFS mirrors to ZFS.
The first boot into the ZFS caused again a reboot after some minutes (I do not know if it was because of a memory exhausted panic, or because of a deadlock), but as I did not tune the kernel for ZFS I am tempted to believe that I should not count that. Now, after tuning the kernel (increasing the kmem_size to 700M, no prefetching, limiting the ARC to 40M) it is up since nearly 2h (as of this writting… crossing fingers). Before it was not able to survive more than some minutes with just the jail for the mails up. Now I not only have the mail-jail up, but also the jail for the blog (one jail still disabled, but I will take care about that after this post).
I do not know if only increasing the kmem_size would have helped with the problem, but as I was testing a GENERIC kernel + gmirror module in the beginning, I expected that the auto-tuning of this value should have been enough for such a simple setup (2GB RAM, 2 disks with 3 partitions each, one partition pair for root, one for swap, one for the jails).
I hope that I stabilized the system now. It may be the case that I will test some patches in case someone comes up with something, so do not be surprised if the blog and email to me is a little bit flaky.
Tags: amd dual core, buffer cache, core system, generic kernel, invariants, mailinglists, prefetching, software mirror, stability problems, switch one —