zfs | Alexander Leidinger

How I setup a Jail-Host

Everyone has his own way of setting up a machine to serve as a host of multiple jails. Here is my way, YMMV.

Initial FreeBSD install

I use several harddisks in a Software-RAID setup. It does not matter much if you set them up with one big partition or with several partitions, feel free to follow your preferences here. My way of partitioning the harddisks is described in a previous post. That post only shows the commands to split the harddisks into two partitions and use ZFS for the rootfs. The commands to initialize the ZFS data partition are not described, but you should be able to figure it out yourself (and you can decide on your own what kind of RAID level you want to use). For this FS I set atime, exec and setuid to off in the ZFS options.

On the ZFS data partition I create a new dataset for the system. For this dataset I set atime, exec and setuid to off in the ZFS options. Inside this dataset I create datasets for /home, /usr/compat, /usr/local, /usr/obj, /usr/ports/, /usr/src, /usr/sup and /var/ports. There are two ways of doing this. One way is to set the ZFS mountpoint. The way I prefer is to set relative symlinks to it, e.g. “cd /usr; ln ‑s ../data/system/usr_obj obj”. I do this because this way I can temporary import the pool on another machine (e.g. my desktop, if the need arises) without fear to interfere with the system. The ZFS options are set as follows:

ZFS options for data/system/*
Dataset	Option	Value
data/system/home	exec	on
data/system/usr_compat	exec	on
data/system/usr_compat	setuid	on
data/system/usr_local	exec	on
data/system/usr_local	setuid	on
data/system/usr_obj	exec	on
data/system/usr_ports	exec	on
data/system/usr_ports	setuid	on
data/system/usr_src	exec	on
data/system/usr_sup	secondarycache	none
data/system/var_ports	exec	on

The exec option for home is not necessary if you keep separate datasets for each user. Normally I keep separate datasets for home directories, but Jail-Hosts should not have users (except the admins, but they should not keep data in their homes), so I just create a single home dataset. The setuid option for the usr_ports should not be necessary if you redirect the build directory of the ports to a different place (WRKDIRPREFIX in /etc/make.conf).

Installing ports

The ports I install by default are net/rsync, ports-mgmt/portaudit, ports-mgmt/portmaster, shells/zsh, sysutils/bsdstats, sysutils/ezjail, sysutils/smartmontools and sysutils/tmux.

Basic setup

In the crontab of root I setup a job to do a portsnap update once a day (I pick a random number between 0 and 59 for the minute, but keep a fixed hour). I also have http_proxy specified in /etc/profile, so that all machines in this network do not download everything from far away again and again, but can get the data from the local caching proxy. As a little watchdog I have a little @reboot rule in the crontab, which notifies me when a machine reboots:

@reboot grep "kernel boot file is" /var/log/messages | mail -s "`hostname` rebooted" root >/dev/null 2>&1

This does not replace a real monitoring solution, but in cases where real monitoring is overkill it provides a nice HEADS-UP (and shows you directly which kernel is loaded in case a non-default one is used).

Some default aliases I use everywhere are:

alias portmlist="portmaster -L | egrep -B1 '(ew|ort) version|Aborting|installed|dependencies|IGNORE|marked|Reason:|MOVED|deleted|exist|update' | grep -v '^--'"
alias portmclean="portmaster -t --clean-distfiles --clean-packages"
alias portmcheck="portmaster -y --check-depends"

Additional devfs rules for Jails

I have the need to give access to some specific devices in some jails. For this I need to setup a custom /etc/devfs.rules file. The files contains some ID numbers which need to be unique in the system. On a 9‑current system the numbers one to four are already used (see /etc/defaults/devfs.rules). The next available number is obviously five then. First I present my devfs.rules entries, then I explain them:

[devfsrules_unhide_audio=5]
add path 'audio*' unhide
add path 'dsp*' unhide
add path midistat unhide
add path 'mixer*' unhide
add path 'music*' unhide
add path 'sequencer*' unhide
add path sndstat unhide
add path speaker unhide

[devfsrules_unhide_printers=6]
add path 'lpt*' unhide
add path 'ulpt*' unhide user 193 group 193
add path 'unlpt*' unhide user 193 group 193

[devfsrules_unhide_zfs=7]
add path zfs unhide

[devfsrules_jail_printserver=8]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_printers
add include $devfsrules_unhide_zfs

[devfsrules_jail_withzfs=9]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_zfs

The devfs_rules_unhide_XXX ones give access to specific devices, e.g. all the sound related devices or to local printers. The devfsrules_jail_XXX ones combine all the unhide rules for specific jail setups. Unfortunately the include directive is not recursive, so that we can not include the default devfsrules_jail profile and need to replicate its contents. The first three includes of each devfsrules_jail_XXX accomplish this. The unhide_zfs rule gives access to /dev/zfs, which is needed if you attach one or more ZFS datasets to a jail. I will explain how to use those profiles with ezjail in a follow-up post.

Jails setup

I use ezjail to manage jails, it is more comfortable than doing it by hand while at the same time allows me to do something by hand. My jails normally reside inside ZFS datasets, for this reason I have setup a special area (ZFS dataset data/jails) which is handled by ezjail.The corresponding ezjail.conf settings are:

ezjail_jaildir=/data/jails
ezjail_use_zfs="YES"
ezjail_jailzfs="data/jails"

I also disabled procfs and fdescfs in jails (but they can be enabled later for specific jails if necessary).

Unfortunately ezjail (as of v3.1) sets the mountpoint of a newly created dataset even if it is not necessary. For this reason I always issue a “zfs inherit mountpoint ” after creating a jail. This simplifies the case where you want to move/rename a dataset and want to have the mountpoint automcatically follow the change.

The access flags of /data/jails directory are 700, this prevents local users (there should be none, but better safe than sorry) to get access to files from users in jails with the same UID.

After the first create/update of the ezjail basejail the ZFS options of basejail (data/jails/basejail) and newjail (data/jails/newjail) need to be changed. For both exec and setuid should be changed to “on” The same needs to be done after creating a new jail for the new jail (before starting it).

The default ezjail flavour

In my default ezjail flavour I create some default user(s) with a basesystem-shell (via /data/jails/flavours/mydef/ezjail.flavour) before the package install, and change the shell to my preferred zsh afterwards (this is only valid if the jails are used only by in-house people, if you want to offer lightweight virtual machines to (unknown) customers, the default user(s) and shell(s) are obviously up to discussion). At the end I also run a “/usr/local/sbin/portmaster ‑y –check-depends” to make sure everything is in a sane state.

For the packages (/data/jails/flavours/mydef/pkg/) I add symlinks to the unversioned packages I want to install. I have the packages in a common (think about setting PACKAGES in make.conf and using PACKAGES/Latest/XYZ.tbz) directory (if they can be shared over various flavours), and they are unversioned so that I do not have to update the version number each time there is an update. The packages I install by default are bsdstats, portaudit, portmaster, zsh, tmux and all their dependencies.

In case you use jails to virtualize services and consolidate servers (e.g. DNS, HTTP, MySQL each in a separate jail) instead of providing lightweight virtual machines to (unknown) customers, there is also a benefit of sharing the distfiles and packages between jails on the same machine. To do this I create /data/jails/flavours/mydef/shared/ports/{distfiles,packages} which are then mounted via nullfs or NFS into all the jails from a common directory. This requires the following variables in /data/jails/flavours/mydef/etc/make.conf (I also keep the packages for different CPU types and compilers in the same subtree, if you do not care, just remove the “/${CC}/${CPUTYPE}” from the PACAKGES line):

DISTDIR=  /shared/ports/distfiles
PACKAGES= /shared/ports/packages/${CC}/${CPUTYPE}

New jails

A future post will cover how I setup new jails in such a setup and how I customize the start order of jails or use some non-default settings for the jail-startup.

Share/Save

Solaris UFS full while df shows plenty of free space/inodes

At work we have a Solaris 8 with a UFS which told the application that it can not create new files. The df command showed plenty if free inodes, and there was also enough space free in the FS. The reason that the application got the error was that while there where still plenty of fragments free, no free block was available anymore. You can not create a new file only with fragments, you need to have at least one free block for each new file.

To see the number of free blocks of a UFS you can call “fstyp ‑v | head ‑18” and look at the value behind “nbfree”.

To get this working again we cleaned up the FS a little bit (compressing/deleting log files), but this is only a temporary solution. Unluckily we can not move this application to a Solaris 10 with ZFS, so I was playing around a little bit to see what we can do.

First I made a histogram of the file sizes. The backup of the FS I was playing with had a little bit more than 4 million files in this FS. 28.5% of them where smaller than or equal 512 bytes, 31.7% where smaller than or equal 1k (fragment size), 36% smaller than or equal 8k (block size) and 74% smaller than or equal 16k. The following graph shows in red the critical part, files which need a block and produce fragments, but can not life with only fragments.

Then I played around with newfs options for this one specific FS with this specific data mix. Changing the number of inodes did not change much the outcome for our problem (as expected). Changing the optimization from “time” to “space” (and restoring all the data from backup into the empty FS) gave us 1000 more free blocks. On a FS which had 10 Mio free blocks when empty this is not much, but we expect that the restore consumes less fragments and more full blocks than the live-FS of the application (we can not compare, as the content of the live-FS changed a lot since we had the problem). We assume that e.g. the logs of the application are split over a lot of fragments instead of full blocks, due to small writes to the logs by the application. The restore should write all the data in big chunks, so our expectation is that the FS will use more full blocks and less fragments. Because of this we expect that the live-FS with this specific data mix could benefit from changing the optimization.

I also played around with the fragment size. The expectation was that it will only change what is reported in the output of df (reducing the reported available space for the same amount of data). Here is the result:

The difference between 1k (default) and 2k is not much. For 8k we would have to much unused space lost. The fragment size of 4k looks like it is acceptable to get a better monitoring status of this particular data mix.

Based upon this we will probably create a new FS with a fragment size of 4k and we will probably switch the optimization directly to “space”. This way we will have a better reporting on the fill level of the FS for our data mix (but we will not be able to fully use the real space of the FS) and as such our monitoring should alert us in time to do a cleanup of the FS or to increase the size of the FS.

Share/Save

ZFS and NFS / on-disk-cache

In the FreeBSD mailinglists I stumbled over a post which refers to a blog-post which describes why ZFS seems to be slow (on Solaris).

In short: ZFS guarantees that the NFS client does not experience silent corruption of data (NFS server crash and loss of data which is supposed to be already on disk for the client). A recommendation is to enable the disk-cache for disks which are completely used by ZFS, as ZFS (unlike UFS) is aware of disk-caches. This increases the performance to what UFS is delivering in the NFS case.

There is no in-deep description of what it means that ZFS is aware of disk-caches, but I think this is a reference to the fact that ZFS is sending a flush command to the disk at the right moments. Letting aside the fact that there are disks out there which lie to you about this (they tell the flush command finished when it is not), this would mean that this is supported in FreeBSD too.

So everyone who is currently disabling the ZIL to get better NFS performance (and accept silent data corruption on the client side): move your zpool to dedicated (no other real FS than ZFS, swap and dump devices are OK) disks (honest ones) and enable the disk-caches instead of disabling the ZIL.

I also recommend that people which have ZFS already on dedicated (and honest) disks have a look if the disk-caches are enabled.

Share/Save

Showing off some numbers…

At work we have some performance problems.

One application (not off-the-shelf software) is not performing good. The problem is that the design of the application is far from good (auto-commit is used, and the Oracle DB is doing too much writes for what the application is supposed to do because of this). During helping our DBAs in their performance analysis (the vendor of the application is telling our hardware is not fast enough and I had to provide some numbers to show that this is not the case and they need to improve the software as it does not comply to the performance requirements they got before developing the application) I noticed that the filesystem where the DB and the application are located (a ZFS if someone is interested) is doing sometimes 1.200 IO (write) operations per second (to write about 100 MB). Yeah, that is a lot of IOops our SAN is able to do! Unfortunately too expensive to buy for use at home. 🙁

Another application (nagios 3.0) was generating a lot of major faults (caused by a lot of fork()s for the checks). It is a SunFire V890, and the highest number of MF per second I have seen on this machine was about 27.000. It never went below 10.000. On average maybe somewhere between 15.000 and 20.000. My Solaris-Desktop (an Ultra 20) is generating maybe several hundred MF if a lot is going on (most of the time is does not generate much). Nobody can say the V890 is not used… 🙂 Oh, yes, I suggested to enable the nagios config setting for large sites, now the major faults are around 0−10.000 and the machine is not that stressed anymore. The next step is probably to have a look at the ancient probes (migrated from the big brother setup which was there several years before) and reduce the number of forks they do.

Share/Save

Some fixes for ZFS on 7‑stable (more testers wanted)

Due to the problems with a 7‑stable machine, I had a look at some unmerged fixes for ZFS (58 changes not merged).

I backported some of those changes from 8‑stable to 7‑stable, I have this running on one 7‑stable machine. I would like to get some more feedback for it (even an “it works for me” would be great). The main part of this change is that the FreeBSD taskqueue is used now instead of the opensolaris one (and some other changes which may improve the ZFS experience).

It would also be nice if someone could have a look at the FIRST_THREAD_IN_PROC part. Can there be more than one thread at this place (I do not think so) and I should use FOREACH_THREAD_IN_PROC_instead?

How to apply:

cd /usr/src/
fetch http://www.Leidinger.net/FreeBSD/test/releng7_zfs_merge3.diff
fetch http://www.Leidinger.net/FreeBSD/test/opensolaris_taskq.c
fetch http://www.Leidinger.net/FreeBSD/test/taskq.h
mv taskq.h sys/cddl/contrib/opensolaris/uts/common/sys/taskq.h
mv opensolaris_taskq.c sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c
patch ‑p 0 –quiet <releng7_zfs_merge3.diff
ignore the 2 .rej files
rm ‑f sys/cddl/compat/opensolaris/sys/taskq_impl.h*
rm ‑f sys/cddl/compat/opensolaris/sys/taskq.h*
rm ‑f sys/cddl/contrib/opensolaris/uts/common/os/taskq.c*
rebuild kernel

I do not list all of those 16 of 58 outstanding patches which are covered here, a detailed list can be found on the stable and fs mailinglists.

Share/Save

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Ini­tial FreeB­SD install

Installing ports

Basic set­up

Addi­tion­al devfs rules for Jails

Jails set­up

The default ezjail flavour

New jails

Initial FreeBSD install

Basic setup

Additional devfs rules for Jails

Jails setup