Cheap pro­cess mon­it­or­ing (no ad­di­tion­al soft­ware re­quired)

I have an old sys­tem (only the hard­ware, it runs -cur­rent) which re­boots it­self from time to time (mostly dur­ing the daily periodic(8) run, but also dur­ing a lot of com­pil­ing (por­tup­grade)). There is no ob­vi­ous reas­on (no pan­ic) why it is do­ing this. It could be that there is some hard­ware de­fect, or some­thing else. It is not im­port­ant enough to get a high enough pri­or­ity that I try hard to ana­lyze the prob­lem with this ma­chine. The an­noy­ing part is, that some­times after a re­start apache does not start. So if this hap­pens, the solu­tion is to lo­gin and start the web­serv­er. If the web­serv­er would start each time, nearly nobody would de­tect the re­boot (root gets an EMail on each re­boot via an @reboot crontab entry).

My prag­mat­ic solu­tion (for ser­vices star­ted via a good rc.d script which has a work­ing status com­mand) is a crontab entry which checks peri­od­ic­ally if it is run­ning and which re­starts the ser­vice if not. As an ex­ample for apache and an in­ter­val of 10 minutes:

*/​10 * * * *    /usr/local/etc/rc.d/apache22 status >/​dev/​null 2>&1 || /usr/local/etc/rc.d/apache22 re­start

For the use case of this service/​machine, this is enough. In case of a prob­lem with the ser­vice, a mail with the re­start out­put would ar­rive each time it runs, else only after a re­boot for which the ser­vice did not re­start.

In­ter­est­ing pro­jects in the GSoC

I coun­ted 18 pro­jects which are giv­en to FreeBSD in this years GSoC. For 3 of them I have some com­ments.

Very in­ter­est­ing to me is the pro­ject which is named Col­lect­ive lim­its on set of pro­cesses (a.k.a. jobs). This looks a bit like the Sol­ar­is contract/​project IDs. If this pro­ject res­ults in some­thing which al­lows the user­land to query which PID be­longs to which set, than this al­lows some nice im­prove­ment for start scripts. For ex­ample at work on Sol­ar­is each ap­plic­a­tion is a mix of sev­er­al pro­jects (apache = “name:web” pro­ject, tom­cat = “name:app” pro­ject, Or­acle DB = “name:ora” pro­ject). Our man­age­ment frame­work (writ­ten by a co-​worker) al­lows to eas­ily do some­thing with those pro­jects, a “show” dis­plays the prstat (sim­il­ar to top) info just for pro­cesses which be­long to the pro­ject, a “kill” sends a kill-​signal to all pro­cesses of the pro­ject, and so on. We could do some­thing sim­il­ar with our start scripts by de­clar­ing a namespace (FreeBSD:base:XXX /​ FreeBSD:ports:XXX?) and maybe num­ber space (de­pend­ing on the im­ple­ment­a­tion) as re­served and use it to see if pro­cesses which be­long to a par­tic­u­lar script are still run­ning or kill them or whatever.

The oth­er two pro­jects I want to com­ment upon here are Com­plete libp­kg and cre­ate new pkg tools and Com­plete Pack­age sup­port in the pkg_​install tools and cleanup. Both pro­jects ref­er­ence libp­kg in their de­scrip­tion. I hope the ment­ors of both pro­jects pay some at­ten­tion to what is go­ing on in the oth­er pro­ject to not cause dependencies/​clashes between the stu­dents.

That I do not men­tion oth­er pro­jects does not mean that they are not in­ter­est­ing or sim­il­ar, it is just that I do not have to say some­thing valu­able about them…

HOWTO ment­or in the GSoC (ini­tial com­mu­nic­a­tion with the stu­dent)

Every ment­or in the GSoC has a dif­fer­ent way of hand­ling stu­dents. Here is what I do.

The stu­dent in­tro­duced him­self to me as re­ques­ted by our soc–ad­mins in the ini­tial mail to our stu­dents. He looked up in which timezone I am (pub­lic info) and presen­ted his timezone (and rough loc­a­tion) to me. That is nice. He also offered dif­fer­ent com­mu­nic­a­tion chan­nels (ba­sic­ally EMail and IM).

I con­firmed what he looked up, and presen­ted what I did in the past GSoC in which I par­ti­cip­ated so that he has an idea if am new to the game or not. I told him that quick/​short ques­tions are bet­ter asked via IM, while long ex­plan­a­tions or ques­tions are bet­ter handled via EMail. I also gave him a rough over­view when he can ex­pect quick an­swers from me and when I am not avail­able.

Fol­low­ing are some ques­tions I asked him, so that I get an im­pres­sion about what to ex­pect and that I can plan a bit (some of those may already be told in stu­dent ap­plic­a­tion, but I prefer to have everything in one place):

  • From when to when do you in­tent to spend how much time for the GSoC?
  • Any hol­i­days /​ non-​availability planned dur­ing the GSoC?
  • Any uni­ver­sity–stuff (exams/​lessons/​…) dur­ing this time (the uni has high­er pri­or­ity than the GSoC for Google)?
  • Any­thing else in par­al­lel of the GSoC (some paid work, tak­ing care about ill (grand-)parents, …)?
  • At what level of know­ledge do you see your­self re­gard­ing computer-​science/​programming/​OS-​concepts (re­l­at­ive to oth­er stu­dents and re­l­at­ive to the top­ic)?
  • How do you want to start about the pro­ject (where do you want to start, what do you in­tent to do… just a quick over­view… a bit more than say­ing “I add X”, but not as far as copy&paste of code ex­amples)?

More im­port­ant than that (IMO), is to give an idea what is ex­pec­ted from the stu­dent:

  • you have FreeBSD-cur­rent in­stalled (on a real PC or in a vir­tu­al ma­chine)
  • you give me a re­port about the status each week (“did noth­ing” is also a val­id re­port, it gives me the info that you are still alive and did not lose in­terest in the GSoC)
  • if your sched­ule changes in a sig­ni­fic­ant way, give me a little no­ti­fic­a­tion (e.g. “I can not do any­thing next week”)
  • if you spend more than 30 minutes with a prob­lem, pre­pare an email with the prob­lem de­scrip­tion; if this pre­par­a­tion did not solve your prob­lem, send me the mail (if you solve the prob­lem 5 minutes later, no prob­lem, I prefer to get a mail too much than to have you stuck with some­thing for an in­cred­ible amount of time)

A ment­or does not know everything, off course, so the stu­dent should be sub­scribed to hackers@ and current@, and if there is a spe­cif­ic list which matches good to the pro­ject he is work­ing on, then to this mail­ing list too. This al­lows the ment­or to tell the stu­dent to send a mail with the ques­tions to one of those lists without much pre­par­a­tion to re­ceive all an­swers.

An­oth­er help­ful re­source is the FreeBSD ker­nel cross-​reference. For some people my doxy­gen gen­er­ated docs of parts of the FreeBSD ker­nel may be help­ful (put un­for­tu­nately not a lot of doxygen-​markup is with­in our source code).

I also told that he shall pre­pare him­self that I will ask him to send a ref­er­ence to a patch of his work long enough be­fore the GSoC ends to an ap­pro­pri­ate mail­ing list, and that com­ments from there re­gard­ing changes he must or shall do are not some­thing bad, but a way to im­prove the res­ult and/​or his skills.

Ment­or­ing again in the GSoC

Seems that I will act­ively ment­or again in this Google Sum­mer of Code (as op­posed to just re­view the sub­mis­sions from stu­dents and/​or act­ing as a fall-​back ment­or).

The pro­ject I will ment­or is the “Make op­tion­al ker­nel sub­sys­tems re­gister them­selves via sy­sctl”-one from the FreeBSD ideas page.

The stu­dent already got in­to con­tact with me and it looks like he is mo­tiv­ated (he is already sub­scribed to sev­er­al FreeBSD mailing­lists, which is not a re­quire­ment we have in our GSoC docs).

One-​Time-​Passwords for Horde/​IMP?

I search a way to use one-​time–pass­words for Horde/​IMP on FreeBSD. I do not want to use PAM (loc­al users on the ma­chine). Cur­rently I use the au­then­tic­a­tion via IMAP4 (link between the IMAP4-​server and post­fix via MySQL, to have the same PW for send­ing and re­ceiv­ing), and I ex­pect that not all users of Horde/​IMP will use OTP if avail­able, so the prob­lem case is not that easy. I can ima­gine a solu­tion which tries to au­then­tic­ate via OTP first, and if it suc­ceeds gets a pass­word for the lo­gin to the IMAP4 serv­er. If the OTP-​auth fails, it could try the entered pass­word for the lo­gin to the IMAP4 serv­er. Mi­grat­ing ex­ist­ing users to a new solu­tion can be done by telling them to enter the pass­word from the ma­chine of the per­son do­ing the mi­gra­tion. The solu­tion needs to auto­mat­ic­ally lo­gin to the IMAP4 serv­er, en­ter­ing a pass­word for the IMAP4 serv­er after the OTP-​login to Horde is not an op­tion.

Oh, yes, send­ing the pass­words over SSL is not an op­tion (that is already the only way to lo­gin there). The goals are to have

  • an easy to re­mem­ber pass­word for an OTP app on the mo­bile to gen­er­ate the real pass­word
  • the pass­word ex­pire fast, so that a stolen pass­word does not cause much harm
  • not the same login-​password for dif­fer­ent ser­vices (mail-​pw != jabber-​pw != user-​pw)

One-​Time-​Passwords for XMPP/​Jabber?

I search a way to use one-​time–pass­words for jabber/​XMPP (ejab­berd) on FreeBSD. I do not want to use PAM (loc­al users on the ma­chine). Cur­rently I use the in­tern­al au­then­tic­a­tion, and I ex­pect that not all users of the jab­ber serv­er will use OTP if avail­able, so the prob­lem case is not that easy (mi­grat­ing ex­ist­ing users to a new solu­tion can be done by chan­ging the pass­word my­self and then telling them to change their pass­word, but there needs to be a way to let them change the non-​OTP pass­word).

I as­sume that OTP is not fore­seen in the XMPP pro­tocol, so where could I ask to have some­thing like that con­sidered as an ex­ten­sion (if such a place ex­ists at all)?

Oh, yes, send­ing the pass­words over SSL is not an op­tion (that is already the only way to lo­gin there). The goals are to have

  • an easy to re­mem­ber pass­word for an OTP app on the mo­bile to gen­er­ate the real pass­word
  • the pass­word ex­pire fast, so that a stolen pass­word does not cause much harm
  • not the same login-​password for dif­fer­ent ser­vices (mail-​pw != jabber-​pw != user-​pw)

ARC (ad­apt­ive re­place­ment cache) ex­plained

At work we have the situ­ation of a slow ap­plic­a­tion. The vendor of the cus­tom ap­plic­a­tion in­sists that the ZFS (Sol­ar­is 10u8) and the Or­acle DB are badly tuned for the ap­plic­a­tion. Part of their tun­ing is to lim­it the ARC to 1 GB (our max size is 24 GB on this ma­chine). One prob­lem we see is that there are many write op­er­a­tions (roun­ded val­ues: 1k ops for up to 100 MB) and the DB is com­plain­ing that the log­writer is not able to write out the data fast enough. At the same time our data­base ad­mins see a lot of com­mits and/​or roll­backs so that the archive log grows very fast to 1.5 GB. The funny thing is… the per­form­ance tests are sup­posed to only cov­er SE­LECTs and small UP­DATEs.

I pro­posed to re­duce the zfs_​txg_​timeout from the de­fault value of 30 to some seconds (and as no re­boot is needed like for the max arc size, this can be done fast in­stead of wait­ing some minutes for the boot-​checks of the M5000). The first try was to re­duce it to 5 seconds and it im­proved the situ­ation. The DB still com­plained about not be­ing able to write out the logs fast enough, but it did not do it as of­ten as be­fore. To make the vendor happy we re­duced the max arc size and tested again. First we have not seen any com­plains from the DB any­more, which looked strange to me be­cause my un­der­stand­ing of the ARC (and the de­scrip­tion of the ZFS Evil Tun­ing Guide re­gard­ing the max size set­ting) sug­gest that this should not show this be­ha­vi­or we have seen, but the ma­chine was also re­booted for this, so there could also be an­oth­er ex­plan­a­tion.

Luck­ily we found out that our test­ing in­fra­struc­ture had a prob­lem so that only a frac­tion of the per­form­ance test was per­formed. This morn­ing the people re­spons­ible for that made some changes and now the DB is com­plain­ing again.

This is what I ex­pec­ted. To make sure I fully un­der­stand the ARC, I had a look at the the­ory be­hind it at the IBM re­search cen­ter (up­date: PDF link). There are some pa­pers which ex­plain how to ex­tend a cache which uses the LRU re­place­ment policy with some lines of code to an ARC. It looks like it would be an im­prove­ment to have a look at which places in FreeBSD a LRU policy is used to test if an ARC would im­prove the cache hit rate. From read­ing the pa­per it looks like there are a lot of places where this should be the case. The au­thors also provide two ad­apt­ive ex­ten­sions to the CLOCK al­gorithm (used in vari­ous OS in the VM sub­sys­tem) which in­dic­ate that such an ap­proach could be be­ne­fi­cial for a VM sys­tem. I already con­tac­ted Alan (the FreeBSD one) and asked if he knows about it and if it could be be­ne­fi­cial for FreeBSD.