ARC (adap­tive replace­ment cache) explained

At work we have the sit­u­a­tion of a slow appli­ca­tion. The ven­dor of the cus­tom appli­ca­tion insists that the ZFS (Solaris 10u8) and the Ora­cle DB are bad­ly tuned for the appli­ca­tion. Part of their tun­ing is to lim­it the ARC to 1 GB (our max size is 24 GB on this machine). One prob­lem we see is that there are many write oper­a­tions (round­ed val­ues: 1k ops for up to 100 MB) and the DB is com­plain­ing that the log­writer is not able to write out the data fast enough. At the same time our data­base admins see a lot of com­mits and/or roll­backs so that the archive log grows very fast to 1.5 GB. The fun­ny thing is… the per­for­mance tests are sup­posed to only cov­er SELECTs and small UPDATEs.

I pro­posed to reduce the zfs_txg_timeout from the default val­ue of 30 to some sec­onds (and as no reboot is need­ed like for the max arc size, this can be done fast instead of wait­ing some min­utes for the boot-checks of the M5000). The first try was to reduce it to 5 sec­onds and it improved the sit­u­a­tion. The DB still com­plained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the ven­dor hap­py we reduced the max arc size and test­ed again. First we have not seen any com­plains from the DB any­more, which looked strange to me because my under­stand­ing of the ARC (and the descrip­tion of the ZFS Evil Tun­ing Guide regard­ing the max size set­ting) sug­gest that this should not show this behav­ior we have seen, but the machine was also reboot­ed for this, so there could also be anoth­er explanation.

Luck­i­ly we found out that our test­ing infra­struc­ture had a prob­lem so that only a frac­tion of the per­for­mance test was per­formed. This morn­ing the peo­ple respon­si­ble for that made some changes and now the DB is com­plain­ing again.

This is what I expect­ed. To make sure I ful­ly under­stand the ARC, I had a look at the the­o­ry behind it at the IBM research cen­ter (update: PDF link). There are some papers which explain how to extend a cache which uses the LRU replace­ment pol­i­cy with some lines of code to an ARC. It looks like it would be an improve­ment to have a look at which places in FreeB­SD a LRU pol­i­cy is used to test if an ARC would improve the cache hit rate. From read­ing the paper it looks like there are a lot of places where this should be the case. The authors also pro­vide two adap­tive exten­sions to the CLOCK algo­rithm (used in var­i­ous OS in the VM sub­sys­tem) which indi­cate that such an approach could be ben­e­fi­cial for a VM sys­tem. I already con­tact­ed Alan (the FreeB­SD one) and asked if he knows about it and if it could be ben­e­fi­cial for FreeBSD.

Show­ing off some numbers…

At work we have some per­for­mance problems.

One appli­ca­tion (not off-the-shelf soft­ware) is not per­form­ing good. The prob­lem is that the design of the appli­ca­tion is far from good (auto-commit is used, and the Ora­cle DB is doing too much writes for what the appli­ca­tion is sup­posed to do because of this). Dur­ing help­ing our DBAs in their per­for­mance analy­sis (the ven­dor of the appli­ca­tion is telling our hard­ware is not fast enough and I had to pro­vide some num­bers to show that this is not the case and they need to improve the soft­ware as it does not com­ply to the per­for­mance require­ments they got before devel­op­ing the appli­ca­tion) I noticed that the filesys­tem where the DB and the appli­ca­tion are locat­ed (a ZFS if some­one is inter­est­ed) is doing some­times 1.200 IO (write) oper­a­tions per sec­ond (to write about 100 MB). Yeah, that is a lot of IOops our SAN is able to do! Unfor­tu­nate­ly too expen­sive to buy for use at home. 🙁

Anoth­er appli­ca­tion (nagios 3.0) was gen­er­at­ing a lot of major faults (caused by a lot of fork()s for the checks). It is a Sun­Fire V890, and the high­est num­ber of MF per sec­ond I have seen on this machine was about 27.000. It nev­er went below 10.000. On aver­age maybe some­where between 15.000 and 20.000. My Solaris-Desktop (an Ultra 20) is gen­er­at­ing maybe sev­er­al hun­dred MF if a lot is going on (most of the time is does not gen­er­ate much). Nobody can say the V890 is not used… 🙂 Oh, yes, I sug­gest­ed to enable the nagios con­fig set­ting for large sites, now the major faults are around 0−10.000 and the machine is not that stressed any­more. The next step is prob­a­bly to have a look at the ancient probes (migrat­ed from the big broth­er set­up which was there sev­er­al years before) and reduce the num­ber of forks they do.

Fire­fox 3.6, final­ly deliv­er­ing a sane proxy handling

At work we have to use a proxy which requires autho­riza­tion. With pre­vi­ous ver­sions (fire­fox 3.0.x and 3.5.y for each valid x and y) I had the prob­lem that each tab request­ed to enter the mas­ter pass­word when start­ing fire­fox, to be able to fill in the proxy-auth data (short­cut: fill in only the first request, and for all oth­ers just hit return/OK). So for each tab I had to do some­thing for the master-password, and after that for each tab I also had to con­firm the proxy-auth stuff.

Very annoy­ing! Oh, I should maybe men­tion that as of this writ­ing I have 31 tabs open. Some­times there are more, some­times there are less.

Now with fire­fox 3.6 this is not the case any­more. Yeah! Great! Final­ly only one time the mas­ter pass­word stuff, and then one time the proxy-auth stuff, and then all tabs proceed.

It took a long time since my first report about this, but now it is final­ly there. This is the best improve­ment in 3.6 for me.

Progress with Net­work­er bugs

Our bug with savep­n­pc which caus­es the post-command to start one minute after the pre-command even if the back­up is not done yet is now hope­ful­ly near the res­o­lu­tion point. We opened a prob­lem report for this in July, this week we where told that there is a patch for it avail­able. The bad part is, that it is avail­able since 3 weeks and nobody told us. The good part is, that we have it installed on a machine now to see if it helps (all zones there seem to be OK, but we have zones where it some­times works and some­times fails, so we are not 100% sure, but we hope the best). We where told that it will be includ­ed in Net­work­er 7.5.1.8.

Our oth­er issues are now at least not in a helpdesk-loop any­more, they seem to have reached the devel­op­ers now.

ZFS & power-failure: stable

At the week­end there was a power-failure at our disaster-recovery-site. As every­thing should be con­nect­ed to the UPS, this should not have had an impact… unfor­tu­nate­ly the guys respon­si­ble for the cabling seem to have not pro­vid­ed enough pow­er con­nec­tions from the UPS. Result: one of our stor­age sys­tems (all vol­umes in sev­er­al RAID5 vir­tu­al disks) for the test sys­tems lost pow­er, 10 hard­disks switched into failed state when the pow­er was sta­ble again (I was told there where sev­er­al small power-failures that day). After telling the soft­ware to have a look at the dri­ves again, all phys­i­cal disks where accepted.

All vol­umes on one of the vir­tu­al disks where dam­aged (actu­al­ly, one of the vir­tu­al disks was dam­aged) beyond repair and we had to recov­er from backup.

All ZFS based mount­points on the good vir­tu­al disks did not show bad behav­ior (zfs clear + zfs scrub for those which showed check­sum errors to make us feel bet­ter). For the UFS based ones… some caused a pan­ic after reboot and we had to run fsck on them before try­ing a sec­ond boot.

We spend a lot more time to get UFS back online, than get­ting ZFS back online. After this expe­ri­ence it looks like our future Solaris 10u8 installs will be with root on ZFS (our work­sta­tions are already like this, but our servers are still at Solaris 10u6).

Exit mobile version
%%footer%%