ARC (adap­tive replace­ment cache) explained

At work we have the sit­u­a­tion of a slow appli­ca­tion. The ven­dor of the cus­tom appli­ca­tion insists that the ZFS (Solaris 10u8) and the Ora­cle DB are bad­ly tuned for the appli­ca­tion. Part of their tun­ing is to lim­it the ARC to 1 GB (our max size is 24 GB on this machine). One prob­lem we see is that there are many write oper­a­tions (round­ed val­ues: 1k ops for up to 100 MB) and the DB is com­plain­ing that the log­writer is not able to write out the data fast enough. At the same time our data­base admins see a lot of com­mits and/or roll­backs so that the archive log grows very fast to 1.5 GB. The fun­ny thing is… the per­for­mance tests are sup­posed to only cov­er SELECTs and small UPDATEs.

I pro­posed to reduce the zfs_txg_timeout from the default val­ue of 30 to some sec­onds (and as no reboot is need­ed like for the max arc size, this can be done fast instead of wait­ing some min­utes for the boot-checks of the M5000). The first try was to reduce it to 5 sec­onds and it improved the sit­u­a­tion. The DB still com­plained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the ven­dor hap­py we reduced the max arc size and test­ed again. First we have not seen any com­plains from the DB any­more, which looked strange to me because my under­stand­ing of the ARC (and the descrip­tion of the ZFS Evil Tun­ing Guide regard­ing the max size set­ting) sug­gest that this should not show this behav­ior we have seen, but the machine was also reboot­ed for this, so there could also be anoth­er expla­na­tion.

Luck­i­ly we found out that our test­ing infra­struc­ture had a prob­lem so that only a frac­tion of the per­for­mance test was per­formed. This morn­ing the peo­ple respon­si­ble for that made some changes and now the DB is com­plain­ing again.

This is what I expect­ed. To make sure I ful­ly under­stand the ARC, I had a look at the the­o­ry behind it at the IBM research cen­ter (update: PDF link). There are some papers which explain how to extend a cache which uses the LRU replace­ment pol­i­cy with some lines of code to an ARC. It looks like it would be an improve­ment to have a look at which places in FreeB­SD a LRU pol­i­cy is used to test if an ARC would improve the cache hit rate. From read­ing the paper it looks like there are a lot of places where this should be the case. The authors also pro­vide two adap­tive exten­sions to the CLOCK algo­rithm (used in var­i­ous OS in the VM sub­sys­tem) which indi­cate that such an approach could be ben­e­fi­cial for a VM sys­tem. I already con­tact­ed Alan (the FreeB­SD one) and asked if he knows about it and if it could be ben­e­fi­cial for FreeB­SD.

Send to Kin­dle

I merged a lot of ZFS patch­es to 7‑stable

Dur­ing the last weeks I iden­ti­fied 64 patch­es for ZFS which are in 8-sta­ble but not in 7‑stable. For 56 of them I had a deep­er look and most of them are com­mit­ed now to 7‑stable. The ones of those 56 which I did not com­mit are not applic­a­ble to 7‑stable (infra­struc­ture dif­fer­ences between 8 and 7).

Unfor­tu­nate­ly this did not solve the sta­bil­i­ty prob­lems I have on a 7‑stable sys­tem.

I also com­mit­ted a diff reduc­tion (between 8‑stable and 7‑stable) patch which also fixed some not so harm­less mis­merges (mem-leak and ini­tial­iz­ing the same mutex twice at dif­fer­ent places). No idea yet if it helps in my case.

I also want to merge the new arc reclaim log­ic from head to 8‑stable and 7‑stable. Maybe I can do this tomor­row.

Cur­rent­ly I run a test with a ker­nel where the shared locks for ZFS are switched to exclu­sive locks.

Send to Kin­dle