How I set­up a Jail-Host

Every­one has his own way of set­ting up a machine to serve as a host of mul­ti­ple jails. Here is my way, YMMV.

Ini­tial FreeB­SD install

I use sev­er­al hard­disks in a Software-RAID set­up. It does not mat­ter much if you set them up with one big par­ti­tion or with sev­er­al par­ti­tions, feel free to fol­low your pref­er­ences here. My way of par­ti­tion­ing the hard­disks is described in a pre­vi­ous post. That post only shows the com­mands to split the hard­disks into two par­ti­tions and use ZFS for the rootfs. The com­mands to ini­tial­ize the ZFS data par­ti­tion are not described, but you should be able to fig­ure it out your­self (and you can decide on your own what kind of RAID lev­el you want to use). For this FS I set atime, exec and setu­id to off in the ZFS options.

On the ZFS data par­ti­tion I cre­ate a new dataset for the sys­tem. For this dataset I set atime, exec and setu­id to off in the ZFS options. Inside this dataset I cre­ate datasets for /home, /usr/compat, /usr/local, /usr/obj, /usr/ports/, /usr/src, /usr/sup and /var/ports. There are two ways of doing this. One way is to set the ZFS mount­point. The way I pre­fer is to set rel­a­tive sym­links to it, e.g. “cd /usr; ln ‑s ../data/system/usr_obj obj”. I do this because this way I can tem­po­rary import the pool on anoth­er machine (e.g. my desk­top, if the need aris­es) with­out fear to inter­fere with the sys­tem. The ZFS options are set as follows:

ZFS options for data/system/*

Dataset

Option

Val­ue
data/system/homeexecon
data/system/usr_compatexecon
data/system/usr_compatsetu­idon
data/system/usr_localexecon
data/system/usr_localsetu­idon
data/system/usr_objexecon
data/system/usr_portsexecon
data/system/usr_portssetu­idon
data/system/usr_srcexecon
data/system/usr_supsec­ondarycachenone
data/system/var_portsexecon

The exec option for home is not nec­es­sary if you keep sep­a­rate datasets for each user. Nor­mal­ly I keep sep­a­rate datasets for home direc­to­ries, but Jail-Hosts should not have users (except the admins, but they should not keep data in their homes), so I just cre­ate a sin­gle home dataset. The setu­id option for the usr_ports should not be nec­es­sary if you redi­rect the build direc­to­ry of the ports to a dif­fer­ent place (WRKDIRPREFIX in /etc/make.conf).

Installing ports

The ports I install by default are net/rsync, ports-mgmt/portaudit, ports-mgmt/portmaster, shells/zsh, sysutils/bsdstats, sysutils/ezjail, sysutils/smartmontools and sysutils/tmux.

Basic set­up

In the crontab of root I set­up a job to do a port­snap update once a day (I pick a ran­dom num­ber between 0 and 59 for the minute, but keep a fixed hour). I also have http_proxy spec­i­fied in /etc/profile, so that all machines in this net­work do not down­load every­thing from far away again and again, but can get the data from the local caching proxy. As a lit­tle watch­dog I have a lit­tle @reboot rule in the crontab, which noti­fies me when a machine reboots:

@reboot grep "kernel boot file is" /var/log/messages | mail -s "`hostname` rebooted" root >/dev/null 2>&1

This does not replace a real mon­i­tor­ing solu­tion, but in cas­es where real mon­i­tor­ing is overkill it pro­vides a nice HEADS-UP (and shows you direct­ly which ker­nel is loaded in case a non-default one is used).

Some default alias­es I use every­where are:

alias portmlist="portmaster -L | egrep -B1 '(ew|ort) version|Aborting|installed|dependencies|IGNORE|marked|Reason:|MOVED|deleted|exist|update' | grep -v '^--'"
alias portmclean="portmaster -t --clean-distfiles --clean-packages"
alias portmcheck="portmaster -y --check-depends"

Addi­tion­al devfs rules for Jails

I have the need to give access to some spe­cif­ic devices in some jails. For this I need to set­up a cus­tom /etc/devfs.rules file. The files con­tains some ID num­bers which need to be unique in the sys­tem. On a 9‑current sys­tem the num­bers one to four are already used (see /etc/defaults/devfs.rules). The next avail­able num­ber is obvi­ous­ly five then. First I present my devfs.rules entries, then I explain them:

[devfsrules_unhide_audio=5]
add path 'audio*' unhide
add path 'dsp*' unhide
add path midistat unhide
add path 'mixer*' unhide
add path 'music*' unhide
add path 'sequencer*' unhide
add path sndstat unhide
add path speaker unhide

[devfsrules_unhide_printers=6]
add path 'lpt*' unhide
add path 'ulpt*' unhide user 193 group 193
add path 'unlpt*' unhide user 193 group 193

[devfsrules_unhide_zfs=7]
add path zfs unhide

[devfsrules_jail_printserver=8]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_printers
add include $devfsrules_unhide_zfs

[devfsrules_jail_withzfs=9]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_zfs

The devfs_rules_unhide_XXX ones give access to spe­cif­ic devices, e.g. all the sound relat­ed devices or to local print­ers. The devfsrules_jail_XXX ones com­bine all the unhide rules for spe­cif­ic jail setups. Unfor­tu­nate­ly the include direc­tive is not recur­sive, so that we can not include the default devfsrules_jail pro­file and need to repli­cate its con­tents. The first three includes of each devfsrules_jail_XXX accom­plish this. The unhide_zfs rule gives access to /dev/zfs, which is need­ed if you attach one or more ZFS datasets to a jail. I will explain how to use those pro­files with ezjail in a follow-up post.

Jails set­up

I use ezjail to man­age jails, it is more com­fort­able than doing it by hand while at the same time allows me to do some­thing by hand. My jails nor­mal­ly reside inside ZFS datasets, for this rea­son I have set­up a spe­cial area (ZFS dataset data/jails) which is han­dled by ezjail.The cor­re­spond­ing ezjail.conf set­tings are:

ezjail_jaildir=/data/jails
ezjail_use_zfs="YES"
ezjail_jailzfs="data/jails"

I also dis­abled procfs and fde­scfs in jails (but they can be enabled lat­er for spe­cif­ic jails if necessary).

Unfor­tu­nate­ly ezjail (as of v3.1) sets the mount­point of a new­ly cre­at­ed dataset even if it is not nec­es­sary. For this rea­son I always issue a “zfs inher­it mount­point ” after cre­at­ing a jail. This sim­pli­fies the case where you want to move/rename a dataset and want to have the mount­point autom­cat­i­cal­ly fol­low the change.

The access flags of  /data/jails direc­to­ry are 700, this pre­vents local users (there should be none, but bet­ter safe than sor­ry) to get access to files from users in jails with the same UID.

After the first create/update of the ezjail base­jail the ZFS options of base­jail (data/jails/basejail) and new­jail (data/jails/newjail) need to be changed. For both exec and setu­id should be changed to “on” The same needs to be done after cre­at­ing a new jail for the new jail (before start­ing it).

The default ezjail flavour

In my default ezjail flavour I cre­ate some default user(s) with a basesystem-shell (via /data/jails/flavours/mydef/ezjail.flavour) before the pack­age install, and change the shell to my pre­ferred zsh after­wards (this is only valid if the jails are used only by in-house peo­ple, if you want to offer light­weight vir­tu­al machines to (unknown) cus­tomers, the default user(s) and shell(s) are obvi­ous­ly up to dis­cus­sion). At the end I also run a “/usr/local/sbin/portmaster ‑y –check-depends” to make sure every­thing is in a sane state.

For the pack­ages (/data/jails/flavours/mydef/pkg/) I add sym­links to the unver­sioned pack­ages I want to install. I have the pack­ages in a com­mon (think about set­ting PACKAGES in make.conf and using PACKAGES/Latest/XYZ.tbz) direc­to­ry (if they can be shared over var­i­ous flavours), and they are unver­sioned so that I do not have to update the ver­sion num­ber each time there is an update. The pack­ages I install by default are bsd­stats, por­tau­dit, port­mas­ter, zsh, tmux and all their dependencies.

In case you use jails to vir­tu­al­ize ser­vices and con­sol­i­date servers (e.g. DNS, HTTP, MySQL each in a sep­a­rate jail) instead of pro­vid­ing light­weight vir­tu­al machines to (unknown) cus­tomers, there is also a ben­e­fit of shar­ing the dis­t­files and pack­ages between jails on the same machine. To do this I cre­ate /data/jails/flavours/mydef/shared/ports/{distfiles,packages} which are then mount­ed via nullfs or NFS into all the jails from a com­mon direc­to­ry. This requires the fol­low­ing vari­ables in /data/jails/flavours/mydef/etc/make.conf (I also keep the pack­ages for dif­fer­ent CPU types and com­pil­ers in the same sub­tree, if you do not care, just remove the “/${CC}/${CPUTYPE}” from the PACAKGES line):

DISTDIR=  /shared/ports/distfiles
PACKAGES= /shared/ports/packages/${CC}/${CPUTYPE}

New jails

A future post will cov­er how I set­up new jails in such a set­up and how I cus­tomize the start order of jails or use some non-default set­tings for the jail-startup.

Solaris UFS full while df shows plen­ty of free space/inodes

At work we have a Solaris 8 with a UFS which told the appli­ca­tion that it can not cre­ate new files. The df com­mand showed plen­ty if free inodes, and there was also enough space free in the FS. The rea­son that the appli­ca­tion got the error was that while there where still plen­ty of frag­ments free, no free block was avail­able any­more. You can not cre­ate a new file only with frag­ments, you need to have at least one free block for each new file.

To see the num­ber of free blocks of a UFS you can call “fstyp ‑v | head ‑18” and look at the val­ue behind “nbfree”.

To get this work­ing again we cleaned up the FS a lit­tle bit (compressing/deleting log files), but this is only a tem­po­rary solu­tion. Unluck­i­ly we can not move this appli­ca­tion to a Solaris 10 with ZFS, so I was play­ing around a lit­tle bit to see what we can do.

First I made a his­togram of the file sizes. The back­up of the FS I was play­ing with had a lit­tle bit more than 4 mil­lion files in this FS. 28.5% of them where small­er than or equal 512 bytes, 31.7% where small­er than or equal 1k (frag­ment size), 36% small­er than or equal 8k (block size) and 74% small­er than or equal 16k. The fol­low­ing graph shows in red the crit­i­cal part, files which need a block and pro­duce frag­ments, but can not life with only fragments.

chart

Then I played around with newfs options for this one spe­cif­ic FS with this spe­cif­ic data mix. Chang­ing the num­ber of inodes did not change much the out­come for our prob­lem (as expect­ed). Chang­ing the opti­miza­tion from “time” to “space” (and restor­ing all the data from back­up into the emp­ty FS) gave us 1000 more free blocks. On a FS which had 10 Mio free blocks when emp­ty this is not much, but we expect that the restore con­sumes less frag­ments and more full blocks than the live-FS of the appli­ca­tion (we can not com­pare, as the con­tent of the live-FS changed a lot since we had the prob­lem). We assume that e.g. the logs of the appli­ca­tion are split over a lot of frag­ments instead of full blocks, due to small writes to the logs by the appli­ca­tion. The restore should write all the data in big chunks, so our expec­ta­tion is that the FS will use more full blocks and less frag­ments. Because of this we expect that the live-FS with this spe­cif­ic data mix could ben­e­fit from chang­ing the optimization.

I also played around with the frag­ment size. The expec­ta­tion was that it will only change what is report­ed in the out­put of df (reduc­ing the report­ed avail­able space for the same amount of data). Here is the result:

chart

The dif­fer­ence between 1k (default) and 2k is not much. For 8k we would have to much unused space lost. The frag­ment size of 4k looks like it is accept­able to get a bet­ter mon­i­tor­ing sta­tus of this par­tic­u­lar data mix.

Based upon this we will prob­a­bly cre­ate a new FS with a frag­ment size of 4k and we will prob­a­bly switch the opti­miza­tion direct­ly to “space”. This way we will have a bet­ter report­ing on the fill lev­el of the FS for our data mix (but we will not be able to ful­ly use the real space of the FS) and as such our mon­i­tor­ing should alert us in time to do a cleanup of the FS or to increase the size of the FS.

ZFS and NFS / on-disk-cache

In the FreeB­SD mail­inglists I stum­bled over  a post which refers to a blog-post which describes why ZFS seems to be slow (on Solaris).

In short: ZFS guar­an­tees that the NFS client does not expe­ri­ence silent cor­rup­tion of data (NFS serv­er crash and loss of data which is sup­posed to be already on disk for the client). A rec­om­men­da­tion is to enable the disk-cache for disks which are com­plete­ly used by ZFS, as ZFS (unlike UFS) is aware of disk-caches. This increas­es the per­for­mance to what UFS is deliv­er­ing in the NFS case.

There is no in-deep descrip­tion of what it means that ZFS is aware of disk-caches, but I think this is a ref­er­ence to the fact that ZFS is send­ing a flush com­mand to the disk at the right moments. Let­ting aside the fact that there are disks out there which lie to you about this (they tell the flush com­mand fin­ished when it is not), this would mean that this is sup­port­ed in FreeB­SD too.

So every­one who is cur­rent­ly dis­abling the ZIL to get bet­ter NFS per­for­mance (and accept silent data cor­rup­tion on the client side): move your zpool to ded­i­cat­ed (no oth­er real FS than ZFS, swap and dump devices are OK) disks (hon­est ones) and enable the disk-caches instead of dis­abling the ZIL.

I also rec­om­mend that peo­ple which have ZFS already on ded­i­cat­ed (and hon­est) disks have a look if the disk-caches are enabled.

Show­ing off some numbers…

At work we have some per­for­mance problems.

One appli­ca­tion (not off-the-shelf soft­ware) is not per­form­ing good. The prob­lem is that the design of the appli­ca­tion is far from good (auto-commit is used, and the Ora­cle DB is doing too much writes for what the appli­ca­tion is sup­posed to do because of this). Dur­ing help­ing our DBAs in their per­for­mance analy­sis (the ven­dor of the appli­ca­tion is telling our hard­ware is not fast enough and I had to pro­vide some num­bers to show that this is not the case and they need to improve the soft­ware as it does not com­ply to the per­for­mance require­ments they got before devel­op­ing the appli­ca­tion) I noticed that the filesys­tem where the DB and the appli­ca­tion are locat­ed (a ZFS if some­one is inter­est­ed) is doing some­times 1.200 IO (write) oper­a­tions per sec­ond (to write about 100 MB). Yeah, that is a lot of IOops our SAN is able to do! Unfor­tu­nate­ly too expen­sive to buy for use at home. 🙁

Anoth­er appli­ca­tion (nagios 3.0) was gen­er­at­ing a lot of major faults (caused by a lot of fork()s for the checks). It is a Sun­Fire V890, and the high­est num­ber of MF per sec­ond I have seen on this machine was about 27.000. It nev­er went below 10.000. On aver­age maybe some­where between 15.000 and 20.000. My Solaris-Desktop (an Ultra 20) is gen­er­at­ing maybe sev­er­al hun­dred MF if a lot is going on (most of the time is does not gen­er­ate much). Nobody can say the V890 is not used… 🙂 Oh, yes, I sug­gest­ed to enable the nagios con­fig set­ting for large sites, now the major faults are around 0−10.000 and the machine is not that stressed any­more. The next step is prob­a­bly to have a look at the ancient probes (migrat­ed from the big broth­er set­up which was there sev­er­al years before) and reduce the num­ber of forks they do.

Some fix­es for ZFS on 7‑stable (more testers wanted)

Due to the prob­lems with a 7‑stable machine, I had a look at some unmerged fix­es for ZFS (58 changes not merged).

I back­port­ed some of those changes from 8‑stable to 7‑stable, I have this run­ning on one 7‑stable machine. I would like to get some more feed­back for it (even an “it works for me” would be great). The main part of this change is that the FreeB­SD taskqueue is used now instead of the open­so­laris one (and some oth­er changes which may improve the ZFS experience).

It would also be nice if some­one could have a look at the FIRST_THREAD_IN_PROC part. Can there be more than one thread at this place (I do not think so) and I should use FOREACH_THREAD_IN_PROC_instead?

How to apply:

  • cd /usr/src/
  • fetch http://www.Leidinger.net/FreeBSD/test/releng7_zfs_merge3.diff
  • fetch http://www.Leidinger.net/FreeBSD/test/opensolaris_taskq.c
  • fetch http://www.Leidinger.net/FreeBSD/test/taskq.h
  • mv taskq.h sys/cddl/contrib/opensolaris/uts/common/sys/taskq.h
  • mv opensolaris_taskq.c sys/cddl/compat/opensolaris/kern/opensolaris_taskq.c
  • patch ‑p 0 –qui­et <releng7_zfs_merge3.diff
  • ignore the 2 .rej files
  • rm ‑f sys/cddl/compat/opensolaris/sys/taskq_impl.h*
  • rm ‑f sys/cddl/compat/opensolaris/sys/taskq.h*
  • rm ‑f sys/cddl/contrib/opensolaris/uts/common/os/taskq.c*
  • rebuild ker­nel

I do not list all of those 16 of 58 out­stand­ing patch­es which are cov­ered here, a detailed list can be found on the sta­ble and fs mailinglists.