ZFS and NFS / on-disk-cache

In the FreeB­SD mail­inglists I stum­bled over  a post which refers to a blog-post which describes why ZFS seems to be slow (on Solaris).

In short: ZFS guar­an­tees that the NFS client does not expe­ri­ence silent cor­rup­tion of data (NFS serv­er crash and loss of data which is sup­posed to be already on disk for the client). A rec­om­men­da­tion is to enable the disk-cache for disks which are com­plete­ly used by ZFS, as ZFS (unlike UFS) is aware of disk-caches. This increas­es the per­for­mance to what UFS is deliv­er­ing in the NFS case.

There is no in-deep descrip­tion of what it means that ZFS is aware of disk-caches, but I think this is a ref­er­ence to the fact that ZFS is send­ing a flush com­mand to the disk at the right moments. Let­ting aside the fact that there are disks out there which lie to you about this (they tell the flush com­mand fin­ished when it is not), this would mean that this is sup­port­ed in FreeB­SD too.

So every­one who is cur­rent­ly dis­abling the ZIL to get bet­ter NFS per­for­mance (and accept silent data cor­rup­tion on the client side): move your zpool to ded­i­cat­ed (no oth­er real FS than ZFS, swap and dump devices are OK) disks (hon­est ones) and enable the disk-caches instead of dis­abling the ZIL.

I also rec­om­mend that peo­ple which have ZFS already on ded­i­cat­ed (and hon­est) disks have a look if the disk-caches are enabled.

Under­stand­ing latency

Bren­dan Gregg of Sun Ora­cle fame made a good expla­na­tion how to visu­al­ize laten­cy to get a bet­ter under­stand­ing of what is going on (and as such about how to solve bot­tle­necks). I have seen all this already in var­i­ous posts in his blog and in the Ana­lyt­ics pack­age in an Open­Stor­age pre­sen­ta­tion, but the ACM arti­cle sum­ma­rizes it very good.

Unfor­tu­nate­ly Ana­lyt­ics is AFAIK not avail­able in Open­So­laris, so we can not go out and adapt it for FreeB­SD (which would prob­a­bly require to port/implement some addi­tion­al dtrace stuff/probes). I am sure some­thing like this would be very inter­est­ing to all those com­pa­nies which use FreeB­SD in an appli­ance (regard­less if it is a stor­age appli­ance like NetApp, or a net­work appli­ance like a Cisco/Juniper router, or any­thing else which has to per­form good).

ARC (adap­tive replace­ment cache) explained

At work we have the sit­u­a­tion of a slow appli­ca­tion. The ven­dor of the cus­tom appli­ca­tion insists that the ZFS (Solaris 10u8) and the Ora­cle DB are bad­ly tuned for the appli­ca­tion. Part of their tun­ing is to lim­it the ARC to 1 GB (our max size is 24 GB on this machine). One prob­lem we see is that there are many write oper­a­tions (round­ed val­ues: 1k ops for up to 100 MB) and the DB is com­plain­ing that the log­writer is not able to write out the data fast enough. At the same time our data­base admins see a lot of com­mits and/or roll­backs so that the archive log grows very fast to 1.5 GB. The fun­ny thing is… the per­for­mance tests are sup­posed to only cov­er SELECTs and small UPDATEs.

I pro­posed to reduce the zfs_txg_timeout from the default val­ue of 30 to some sec­onds (and as no reboot is need­ed like for the max arc size, this can be done fast instead of wait­ing some min­utes for the boot-checks of the M5000). The first try was to reduce it to 5 sec­onds and it improved the sit­u­a­tion. The DB still com­plained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the ven­dor hap­py we reduced the max arc size and test­ed again. First we have not seen any com­plains from the DB any­more, which looked strange to me because my under­stand­ing of the ARC (and the descrip­tion of the ZFS Evil Tun­ing Guide regard­ing the max size set­ting) sug­gest that this should not show this behav­ior we have seen, but the machine was also reboot­ed for this, so there could also be anoth­er explanation.

Luck­i­ly we found out that our test­ing infra­struc­ture had a prob­lem so that only a frac­tion of the per­for­mance test was per­formed. This morn­ing the peo­ple respon­si­ble for that made some changes and now the DB is com­plain­ing again.

This is what I expect­ed. To make sure I ful­ly under­stand the ARC, I had a look at the the­o­ry behind it at the IBM research cen­ter (update: PDF link). There are some papers which explain how to extend a cache which uses the LRU replace­ment pol­i­cy with some lines of code to an ARC. It looks like it would be an improve­ment to have a look at which places in FreeB­SD a LRU pol­i­cy is used to test if an ARC would improve the cache hit rate. From read­ing the paper it looks like there are a lot of places where this should be the case. The authors also pro­vide two adap­tive exten­sions to the CLOCK algo­rithm (used in var­i­ous OS in the VM sub­sys­tem) which indi­cate that such an approach could be ben­e­fi­cial for a VM sys­tem. I already con­tact­ed Alan (the FreeB­SD one) and asked if he knows about it and if it could be ben­e­fi­cial for FreeBSD.

Mak­ing ZFS faster…

Cur­rent­ly I play a lit­tle bit around with my ZFS set­up. I want to make it faster, but I do not want to spend a lot of money.

The disks are con­nect­ed to an ICH 5 con­troller, so an obvi­ous improve­ment would be to either buy a con­troller for the PCI slot which is able to do NCQ with the SATA disks (a siis(4) based one is not cheap), or to buy a new sys­tem which comes with a chipset which knows how to do NCQ (this would mean new RAM, new CPU, new MB and maybe even a new PSU). A new con­troller is a lit­tle bit expen­sive for the old sys­tem which I want to tune. A new sys­tem would be nice, and read­ing about the specs of new sys­tems lets me want to get a Core i5 sys­tem. The prob­lem is that I think the cur­rent offers of main­boards for this are far from good. The sys­tem should be a lit­tle bit future proof, as I would like to use it for about 5 years or more (the cur­rent sys­tem is some­where between 5 – 6 years old). This means it should have SATA‑3 and USB 3, but when I look at what is offered cur­rent­ly it looks like there are only beta-versions of hard­ware with SATA‑3 and USB 3 sup­port avail­able on the marked (accord­ing to tests there is a lot of vari­ance of the max speed the con­trollers are able to achieve, bugs in the BIOS, or the  con­trollers are attached to a slow bus which pre­vents to use the full band­width). So it will not be a new sys­tem soon.

As I had a 1GB USB-stick around, I decid­ed to attach it to the one of the EHCI USB ports and use it as a cache device for ZFS. If some­one wants to try this too, be care­ful with the USB ports. My main­board has only 2 USB ports con­nect­ed to an EHCI, the rest are UHCI ones. This means that only 2 USB ports are fast (sort of… 40 MBit/s), the rest is only usable for slow things like a mouse, key­board or a ser­i­al line.

Be warned, this will not give you a lot of band­width (if you have a fast USB stick, the 40MBit/s of the EHCI are the lim­it which pre­vent a big stream­ing band­width), but the laten­cy of the cache device is great when doing small ran­dom IO. When I do a gstat and have a look how long a read oper­a­tion takes for each involved device, I see some­thing between 3 msec and 20 msec for the hard­disks (depend­ing if they are read­ing some­thing at the cur­rent head posi­tion, or if the hard­disk needs to seek around a lot). For the cache device (the USB stick) I see some­thing between around 1 mssec and 5 msec. That is 1/3th to 1/4th of the laten­cy of the harddisks.

With a “zfs send” I see about 300 IOops per hard­disk (3 disks in a RAIDZ). Obvi­ous­ly this is an opti­mum stream­ing case where the disks do not need to seek around a lot. You see this in the low laten­cy, it is about 2 msec in this case. In the random-read case, like for exam­ple when you run a find, the disks can not keep this amount of IOops, as they need to seek around. And here the USB-stick shines. I’ve seen upto 1600 IOops on it dur­ing run­ning a find (if the cor­re­spond­ing data is in the cache, off course). This was with some­thing between 0.5 and 0.8 msec of latency.

This is the machine at home which is tak­ing care about my mails (incom­ing and out­go­ing SMTP, IMAP and Web­mail), has a squid proxy and acts as a file serv­er. There are not many users (just me and my wife) and there is no reg­u­lar usage pat­tern for all those ser­vices. Because of this I did not do any bench­mark to see how much time I can gain with var­i­ous work­loads (and I am not inter­est­ed in some arti­fi­cial per­for­mance num­bers of my web­mail ses­sion, as the brows­ing expe­ri­ence is high­ly sub­jec­tive in this case). For this sys­tem a 1 GB USB stick (which was just col­lect­ing dust before) seems to be a cheap way to improve the response time for often used small data. When I use the web­mail inter­face now, my sub­jec­tive impres­sion is, that it is faster. I am talk­ing about list­ing emails (sub­ject, date, sender, size) and dis­play­ing the con­tent of some emails. FYI, my maildir stor­age has 849 MB with 35000 files in 91 folders.

Bot­tom line is: do not expect a lot of band­width increase with this, but if you have a work­load which gen­er­ates ran­dom read requests and you want to decrease the read laten­cy, it could be a cheap solu­tion to add a (big) USB stick as a cache device.

Show­ing off some numbers…

At work we have some per­for­mance problems.

One appli­ca­tion (not off-the-shelf soft­ware) is not per­form­ing good. The prob­lem is that the design of the appli­ca­tion is far from good (auto-commit is used, and the Ora­cle DB is doing too much writes for what the appli­ca­tion is sup­posed to do because of this). Dur­ing help­ing our DBAs in their per­for­mance analy­sis (the ven­dor of the appli­ca­tion is telling our hard­ware is not fast enough and I had to pro­vide some num­bers to show that this is not the case and they need to improve the soft­ware as it does not com­ply to the per­for­mance require­ments they got before devel­op­ing the appli­ca­tion) I noticed that the filesys­tem where the DB and the appli­ca­tion are locat­ed (a ZFS if some­one is inter­est­ed) is doing some­times 1.200 IO (write) oper­a­tions per sec­ond (to write about 100 MB). Yeah, that is a lot of IOops our SAN is able to do! Unfor­tu­nate­ly too expen­sive to buy for use at home. 🙁

Anoth­er appli­ca­tion (nagios 3.0) was gen­er­at­ing a lot of major faults (caused by a lot of fork()s for the checks). It is a Sun­Fire V890, and the high­est num­ber of MF per sec­ond I have seen on this machine was about 27.000. It nev­er went below 10.000. On aver­age maybe some­where between 15.000 and 20.000. My Solaris-Desktop (an Ultra 20) is gen­er­at­ing maybe sev­er­al hun­dred MF if a lot is going on (most of the time is does not gen­er­ate much). Nobody can say the V890 is not used… 🙂 Oh, yes, I sug­gest­ed to enable the nagios con­fig set­ting for large sites, now the major faults are around 0−10.000 and the machine is not that stressed any­more. The next step is prob­a­bly to have a look at the ancient probes (migrat­ed from the big broth­er set­up which was there sev­er­al years before) and reduce the num­ber of forks they do.