Inter­est­ing projects in the GSoC

I count­ed 18 projects which are giv­en to FreeB­SD in this years GSoC. For 3 of them I have some comments.

Very inter­est­ing to me is the project which is named Col­lec­tive lim­its on set of process­es (a.k.a. jobs). This looks a bit like the Solaris contract/project IDs. If this project results in some­thing which allows the user­land to query which PID belongs to which set, than this allows some nice improve­ment for start scripts. For exam­ple at work on Solaris each appli­ca­tion is a mix of sev­er­al projects (apache = “name:web” project, tom­cat = “name:app” project, Ora­cle DB = “name:ora” project). Our man­age­ment frame­work (writ­ten by a co-worker) allows to eas­i­ly do some­thing with those projects, a “show” dis­plays the prstat (sim­i­lar to top) info just for process­es which belong to the project, a “kill” sends a kill-signal to all process­es of the project, and so on. We could do some­thing sim­i­lar with our start scripts by declar­ing a name­space (FreeBSD:base:XXX / FreeBSD:ports:XXX?) and maybe num­ber space (depend­ing on the imple­men­ta­tion) as reserved and use it to see if process­es which belong to a par­tic­u­lar script are still run­ning or kill them or whatever.

The oth­er two projects I want to com­ment upon here are Com­plete libp­kg and cre­ate new pkg tools and Com­plete Pack­age sup­port in the pkg_install tools and cleanup. Both projects ref­er­ence libp­kg in their descrip­tion. I hope the men­tors of both projects pay some atten­tion to what is going on in the oth­er project to not cause dependencies/clashes between the students.

That I do not men­tion oth­er projects does not mean that they are not inter­est­ing or sim­i­lar, it is just that I do not have to say some­thing valu­able about them…

ARC (adap­tive replace­ment cache) explained

At work we have the sit­u­a­tion of a slow appli­ca­tion. The ven­dor of the cus­tom appli­ca­tion insists that the ZFS (Solaris 10u8) and the Ora­cle DB are bad­ly tuned for the appli­ca­tion. Part of their tun­ing is to lim­it the ARC to 1 GB (our max size is 24 GB on this machine). One prob­lem we see is that there are many write oper­a­tions (round­ed val­ues: 1k ops for up to 100 MB) and the DB is com­plain­ing that the log­writer is not able to write out the data fast enough. At the same time our data­base admins see a lot of com­mits and/or roll­backs so that the archive log grows very fast to 1.5 GB. The fun­ny thing is… the per­for­mance tests are sup­posed to only cov­er SELECTs and small UPDATEs.

I pro­posed to reduce the zfs_txg_timeout from the default val­ue of 30 to some sec­onds (and as no reboot is need­ed like for the max arc size, this can be done fast instead of wait­ing some min­utes for the boot-checks of the M5000). The first try was to reduce it to 5 sec­onds and it improved the sit­u­a­tion. The DB still com­plained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the ven­dor hap­py we reduced the max arc size and test­ed again. First we have not seen any com­plains from the DB any­more, which looked strange to me because my under­stand­ing of the ARC (and the descrip­tion of the ZFS Evil Tun­ing Guide regard­ing the max size set­ting) sug­gest that this should not show this behav­ior we have seen, but the machine was also reboot­ed for this, so there could also be anoth­er explanation.

Luck­i­ly we found out that our test­ing infra­struc­ture had a prob­lem so that only a frac­tion of the per­for­mance test was per­formed. This morn­ing the peo­ple respon­si­ble for that made some changes and now the DB is com­plain­ing again.

This is what I expect­ed. To make sure I ful­ly under­stand the ARC, I had a look at the the­o­ry behind it at the IBM research cen­ter (update: PDF link). There are some papers which explain how to extend a cache which uses the LRU replace­ment pol­i­cy with some lines of code to an ARC. It looks like it would be an improve­ment to have a look at which places in FreeB­SD a LRU pol­i­cy is used to test if an ARC would improve the cache hit rate. From read­ing the paper it looks like there are a lot of places where this should be the case. The authors also pro­vide two adap­tive exten­sions to the CLOCK algo­rithm (used in var­i­ous OS in the VM sub­sys­tem) which indi­cate that such an approach could be ben­e­fi­cial for a VM sys­tem. I already con­tact­ed Alan (the FreeB­SD one) and asked if he knows about it and if it could be ben­e­fi­cial for FreeBSD.

Show­ing off some numbers…

At work we have some per­for­mance problems.

One appli­ca­tion (not off-the-shelf soft­ware) is not per­form­ing good. The prob­lem is that the design of the appli­ca­tion is far from good (auto-commit is used, and the Ora­cle DB is doing too much writes for what the appli­ca­tion is sup­posed to do because of this). Dur­ing help­ing our DBAs in their per­for­mance analy­sis (the ven­dor of the appli­ca­tion is telling our hard­ware is not fast enough and I had to pro­vide some num­bers to show that this is not the case and they need to improve the soft­ware as it does not com­ply to the per­for­mance require­ments they got before devel­op­ing the appli­ca­tion) I noticed that the filesys­tem where the DB and the appli­ca­tion are locat­ed (a ZFS if some­one is inter­est­ed) is doing some­times 1.200 IO (write) oper­a­tions per sec­ond (to write about 100 MB). Yeah, that is a lot of IOops our SAN is able to do! Unfor­tu­nate­ly too expen­sive to buy for use at home. 🙁

Anoth­er appli­ca­tion (nagios 3.0) was gen­er­at­ing a lot of major faults (caused by a lot of fork()s for the checks). It is a Sun­Fire V890, and the high­est num­ber of MF per sec­ond I have seen on this machine was about 27.000. It nev­er went below 10.000. On aver­age maybe some­where between 15.000 and 20.000. My Solaris-Desktop (an Ultra 20) is gen­er­at­ing maybe sev­er­al hun­dred MF if a lot is going on (most of the time is does not gen­er­ate much). Nobody can say the V890 is not used… 🙂 Oh, yes, I sug­gest­ed to enable the nagios con­fig set­ting for large sites, now the major faults are around 0−10.000 and the machine is not that stressed any­more. The next step is prob­a­bly to have a look at the ancient probes (migrat­ed from the big broth­er set­up which was there sev­er­al years before) and reduce the num­ber of forks they do.