Solaris | Alexander Leidinger

ZFS and NFS / on-disk-cache

In the FreeBSD mailinglists I stumbled over a post which refers to a blog-post which describes why ZFS seems to be slow (on Solaris).

In short: ZFS guarantees that the NFS client does not experience silent corruption of data (NFS server crash and loss of data which is supposed to be already on disk for the client). A recommendation is to enable the disk-cache for disks which are completely used by ZFS, as ZFS (unlike UFS) is aware of disk-caches. This increases the performance to what UFS is delivering in the NFS case.

There is no in-deep description of what it means that ZFS is aware of disk-caches, but I think this is a reference to the fact that ZFS is sending a flush command to the disk at the right moments. Letting aside the fact that there are disks out there which lie to you about this (they tell the flush command finished when it is not), this would mean that this is supported in FreeBSD too.

So everyone who is currently disabling the ZIL to get better NFS performance (and accept silent data corruption on the client side): move your zpool to dedicated (no other real FS than ZFS, swap and dump devices are OK) disks (honest ones) and enable the disk-caches instead of disabling the ZIL.

I also recommend that people which have ZFS already on dedicated (and honest) disks have a look if the disk-caches are enabled.

Share/Save

Understanding latency

Brendan Gregg of Sun Oracle fame made a good explanation how to visualize latency to get a better understanding of what is going on (and as such about how to solve bottlenecks). I have seen all this already in various posts in his blog and in the Analytics package in an OpenStorage presentation, but the ACM article summarizes it very good.

Unfortunately Analytics is AFAIK not available in OpenSolaris, so we can not go out and adapt it for FreeBSD (which would probably require to port/implement some additional dtrace stuff/probes). I am sure something like this would be very interesting to all those companies which use FreeBSD in an appliance (regardless if it is a storage appliance like NetApp, or a network appliance like a Cisco/Juniper router, or anything else which has to perform good).

Share/Save

ARC (adaptive replacement cache) explained

At work we have the situation of a slow application. The vendor of the custom application insists that the ZFS (Solaris 10u8) and the Oracle DB are badly tuned for the application. Part of their tuning is to limit the ARC to 1 GB (our max size is 24 GB on this machine). One problem we see is that there are many write operations (rounded values: 1k ops for up to 100 MB) and the DB is complaining that the logwriter is not able to write out the data fast enough. At the same time our database admins see a lot of commits and/or rollbacks so that the archive log grows very fast to 1.5 GB. The funny thing is… the performance tests are supposed to only cover SELECTs and small UPDATEs.

I proposed to reduce the zfs_txg_timeout from the default value of 30 to some seconds (and as no reboot is needed like for the max arc size, this can be done fast instead of waiting some minutes for the boot-checks of the M5000). The first try was to reduce it to 5 seconds and it improved the situation. The DB still complained about not being able to write out the logs fast enough, but it did not do it as often as before. To make the vendor happy we reduced the max arc size and tested again. First we have not seen any complains from the DB anymore, which looked strange to me because my understanding of the ARC (and the description of the ZFS Evil Tuning Guide regarding the max size setting) suggest that this should not show this behavior we have seen, but the machine was also rebooted for this, so there could also be another explanation.

Luckily we found out that our testing infrastructure had a problem so that only a fraction of the performance test was performed. This morning the people responsible for that made some changes and now the DB is complaining again.

This is what I expected. To make sure I fully understand the ARC, I had a look at the theory behind it at the IBM research center (update: PDF link). There are some papers which explain how to extend a cache which uses the LRU replacement policy with some lines of code to an ARC. It looks like it would be an improvement to have a look at which places in FreeBSD a LRU policy is used to test if an ARC would improve the cache hit rate. From reading the paper it looks like there are a lot of places where this should be the case. The authors also provide two adaptive extensions to the CLOCK algorithm (used in various OS in the VM subsystem) which indicate that such an approach could be beneficial for a VM system. I already contacted Alan (the FreeBSD one) and asked if he knows about it and if it could be beneficial for FreeBSD.

Share/Save

Making ZFS faster…

Currently I play a little bit around with my ZFS setup. I want to make it faster, but I do not want to spend a lot of money.

The disks are connected to an ICH 5 controller, so an obvious improvement would be to either buy a controller for the PCI slot which is able to do NCQ with the SATA disks (a siis(4) based one is not cheap), or to buy a new system which comes with a chipset which knows how to do NCQ (this would mean new RAM, new CPU, new MB and maybe even a new PSU). A new controller is a little bit expensive for the old system which I want to tune. A new system would be nice, and reading about the specs of new systems lets me want to get a Core i5 system. The problem is that I think the current offers of mainboards for this are far from good. The system should be a little bit future proof, as I would like to use it for about 5 years or more (the current system is somewhere between 5 – 6 years old). This means it should have SATA‑3 and USB 3, but when I look at what is offered currently it looks like there are only beta-versions of hardware with SATA‑3 and USB 3 support available on the marked (according to tests there is a lot of variance of the max speed the controllers are able to achieve, bugs in the BIOS, or the controllers are attached to a slow bus which prevents to use the full bandwidth). So it will not be a new system soon.

As I had a 1GB USB-stick around, I decided to attach it to the one of the EHCI USB ports and use it as a cache device for ZFS. If someone wants to try this too, be careful with the USB ports. My mainboard has only 2 USB ports connected to an EHCI, the rest are UHCI ones. This means that only 2 USB ports are fast (sort of… 40 MBit/s), the rest is only usable for slow things like a mouse, keyboard or a serial line.

Be warned, this will not give you a lot of bandwidth (if you have a fast USB stick, the 40MBit/s of the EHCI are the limit which prevent a big streaming bandwidth), but the latency of the cache device is great when doing small random IO. When I do a gstat and have a look how long a read operation takes for each involved device, I see something between 3 msec and 20 msec for the harddisks (depending if they are reading something at the current head position, or if the harddisk needs to seek around a lot). For the cache device (the USB stick) I see something between around 1 mssec and 5 msec. That is 1/3^th to 1/4^th of the latency of the harddisks.

With a “zfs send” I see about 300 IOops per harddisk (3 disks in a RAIDZ). Obviously this is an optimum streaming case where the disks do not need to seek around a lot. You see this in the low latency, it is about 2 msec in this case. In the random-read case, like for example when you run a find, the disks can not keep this amount of IOops, as they need to seek around. And here the USB-stick shines. I’ve seen upto 1600 IOops on it during running a find (if the corresponding data is in the cache, off course). This was with something between 0.5 and 0.8 msec of latency.

This is the machine at home which is taking care about my mails (incoming and outgoing SMTP, IMAP and Webmail), has a squid proxy and acts as a file server. There are not many users (just me and my wife) and there is no regular usage pattern for all those services. Because of this I did not do any benchmark to see how much time I can gain with various workloads (and I am not interested in some artificial performance numbers of my webmail session, as the browsing experience is highly subjective in this case). For this system a 1 GB USB stick (which was just collecting dust before) seems to be a cheap way to improve the response time for often used small data. When I use the webmail interface now, my subjective impression is, that it is faster. I am talking about listing emails (subject, date, sender, size) and displaying the content of some emails. FYI, my maildir storage has 849 MB with 35000 files in 91 folders.

Bottom line is: do not expect a lot of bandwidth increase with this, but if you have a workload which generates random read requests and you want to decrease the read latency, it could be a cheap solution to add a (big) USB stick as a cache device.

Share/Save

Showing off some numbers…

At work we have some performance problems.

One application (not off-the-shelf software) is not performing good. The problem is that the design of the application is far from good (auto-commit is used, and the Oracle DB is doing too much writes for what the application is supposed to do because of this). During helping our DBAs in their performance analysis (the vendor of the application is telling our hardware is not fast enough and I had to provide some numbers to show that this is not the case and they need to improve the software as it does not comply to the performance requirements they got before developing the application) I noticed that the filesystem where the DB and the application are located (a ZFS if someone is interested) is doing sometimes 1.200 IO (write) operations per second (to write about 100 MB). Yeah, that is a lot of IOops our SAN is able to do! Unfortunately too expensive to buy for use at home. 🙁

Another application (nagios 3.0) was generating a lot of major faults (caused by a lot of fork()s for the checks). It is a SunFire V890, and the highest number of MF per second I have seen on this machine was about 27.000. It never went below 10.000. On average maybe somewhere between 15.000 and 20.000. My Solaris-Desktop (an Ultra 20) is generating maybe several hundred MF if a lot is going on (most of the time is does not generate much). Nobody can say the V890 is not used… 🙂 Oh, yes, I suggested to enable the nagios config setting for large sites, now the major faults are around 0−10.000 and the machine is not that stressed anymore. The next step is probably to have a look at the ancient probes (migrated from the big brother setup which was there several years before) and reduce the number of forks they do.

Share/Save

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30