How I set­up a Jail-Host

Every­one has his own way of set­ting up a machine to serve as a host of mul­ti­ple jails. Here is my way, YMMV.

Ini­tial FreeB­SD install

I use sev­er­al hard­disks in a Software-RAID set­up. It does not mat­ter much if you set them up with one big par­ti­tion or with sev­er­al par­ti­tions, feel free to fol­low your pref­er­ences here. My way of par­ti­tion­ing the hard­disks is described in a pre­vi­ous post. That post only shows the com­mands to split the hard­disks into two par­ti­tions and use ZFS for the rootfs. The com­mands to ini­tial­ize the ZFS data par­ti­tion are not described, but you should be able to fig­ure it out your­self (and you can decide on your own what kind of RAID lev­el you want to use). For this FS I set atime, exec and setu­id to off in the ZFS options.

On the ZFS data par­ti­tion I cre­ate a new dataset for the sys­tem. For this dataset I set atime, exec and setu­id to off in the ZFS options. Inside this dataset I cre­ate datasets for /home, /usr/compat, /usr/local, /usr/obj, /usr/ports/, /usr/src, /usr/sup and /var/ports. There are two ways of doing this. One way is to set the ZFS mount­point. The way I pre­fer is to set rel­a­tive sym­links to it, e.g. “cd /usr; ln ‑s ../data/system/usr_obj obj”. I do this because this way I can tem­po­rary import the pool on anoth­er machine (e.g. my desk­top, if the need aris­es) with­out fear to inter­fere with the sys­tem. The ZFS options are set as follows:

ZFS options for data/system/*

Dataset

Option

Val­ue
data/system/homeexecon
data/system/usr_compatexecon
data/system/usr_compatsetu­idon
data/system/usr_localexecon
data/system/usr_localsetu­idon
data/system/usr_objexecon
data/system/usr_portsexecon
data/system/usr_portssetu­idon
data/system/usr_srcexecon
data/system/usr_supsec­ondarycachenone
data/system/var_portsexecon

The exec option for home is not nec­es­sary if you keep sep­a­rate datasets for each user. Nor­mal­ly I keep sep­a­rate datasets for home direc­to­ries, but Jail-Hosts should not have users (except the admins, but they should not keep data in their homes), so I just cre­ate a sin­gle home dataset. The setu­id option for the usr_ports should not be nec­es­sary if you redi­rect the build direc­to­ry of the ports to a dif­fer­ent place (WRKDIRPREFIX in /etc/make.conf).

Installing ports

The ports I install by default are net/rsync, ports-mgmt/portaudit, ports-mgmt/portmaster, shells/zsh, sysutils/bsdstats, sysutils/ezjail, sysutils/smartmontools and sysutils/tmux.

Basic set­up

In the crontab of root I set­up a job to do a port­snap update once a day (I pick a ran­dom num­ber between 0 and 59 for the minute, but keep a fixed hour). I also have http_proxy spec­i­fied in /etc/profile, so that all machines in this net­work do not down­load every­thing from far away again and again, but can get the data from the local caching proxy. As a lit­tle watch­dog I have a lit­tle @reboot rule in the crontab, which noti­fies me when a machine reboots:

@reboot grep "kernel boot file is" /var/log/messages | mail -s "`hostname` rebooted" root >/dev/null 2>&1

This does not replace a real mon­i­tor­ing solu­tion, but in cas­es where real mon­i­tor­ing is overkill it pro­vides a nice HEADS-UP (and shows you direct­ly which ker­nel is loaded in case a non-default one is used).

Some default alias­es I use every­where are:

alias portmlist="portmaster -L | egrep -B1 '(ew|ort) version|Aborting|installed|dependencies|IGNORE|marked|Reason:|MOVED|deleted|exist|update' | grep -v '^--'"
alias portmclean="portmaster -t --clean-distfiles --clean-packages"
alias portmcheck="portmaster -y --check-depends"

Addi­tion­al devfs rules for Jails

I have the need to give access to some spe­cif­ic devices in some jails. For this I need to set­up a cus­tom /etc/devfs.rules file. The files con­tains some ID num­bers which need to be unique in the sys­tem. On a 9‑current sys­tem the num­bers one to four are already used (see /etc/defaults/devfs.rules). The next avail­able num­ber is obvi­ous­ly five then. First I present my devfs.rules entries, then I explain them:

[devfsrules_unhide_audio=5]
add path 'audio*' unhide
add path 'dsp*' unhide
add path midistat unhide
add path 'mixer*' unhide
add path 'music*' unhide
add path 'sequencer*' unhide
add path sndstat unhide
add path speaker unhide

[devfsrules_unhide_printers=6]
add path 'lpt*' unhide
add path 'ulpt*' unhide user 193 group 193
add path 'unlpt*' unhide user 193 group 193

[devfsrules_unhide_zfs=7]
add path zfs unhide

[devfsrules_jail_printserver=8]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_printers
add include $devfsrules_unhide_zfs

[devfsrules_jail_withzfs=9]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_zfs

The devfs_rules_unhide_XXX ones give access to spe­cif­ic devices, e.g. all the sound relat­ed devices or to local print­ers. The devfsrules_jail_XXX ones com­bine all the unhide rules for spe­cif­ic jail setups. Unfor­tu­nate­ly the include direc­tive is not recur­sive, so that we can not include the default devfsrules_jail pro­file and need to repli­cate its con­tents. The first three includes of each devfsrules_jail_XXX accom­plish this. The unhide_zfs rule gives access to /dev/zfs, which is need­ed if you attach one or more ZFS datasets to a jail. I will explain how to use those pro­files with ezjail in a follow-up post.

Jails set­up

I use ezjail to man­age jails, it is more com­fort­able than doing it by hand while at the same time allows me to do some­thing by hand. My jails nor­mal­ly reside inside ZFS datasets, for this rea­son I have set­up a spe­cial area (ZFS dataset data/jails) which is han­dled by ezjail.The cor­re­spond­ing ezjail.conf set­tings are:

ezjail_jaildir=/data/jails
ezjail_use_zfs="YES"
ezjail_jailzfs="data/jails"

I also dis­abled procfs and fde­scfs in jails (but they can be enabled lat­er for spe­cif­ic jails if necessary).

Unfor­tu­nate­ly ezjail (as of v3.1) sets the mount­point of a new­ly cre­at­ed dataset even if it is not nec­es­sary. For this rea­son I always issue a “zfs inher­it mount­point ” after cre­at­ing a jail. This sim­pli­fies the case where you want to move/rename a dataset and want to have the mount­point autom­cat­i­cal­ly fol­low the change.

The access flags of  /data/jails direc­to­ry are 700, this pre­vents local users (there should be none, but bet­ter safe than sor­ry) to get access to files from users in jails with the same UID.

After the first create/update of the ezjail base­jail the ZFS options of base­jail (data/jails/basejail) and new­jail (data/jails/newjail) need to be changed. For both exec and setu­id should be changed to “on” The same needs to be done after cre­at­ing a new jail for the new jail (before start­ing it).

The default ezjail flavour

In my default ezjail flavour I cre­ate some default user(s) with a basesystem-shell (via /data/jails/flavours/mydef/ezjail.flavour) before the pack­age install, and change the shell to my pre­ferred zsh after­wards (this is only valid if the jails are used only by in-house peo­ple, if you want to offer light­weight vir­tu­al machines to (unknown) cus­tomers, the default user(s) and shell(s) are obvi­ous­ly up to dis­cus­sion). At the end I also run a “/usr/local/sbin/portmaster ‑y –check-depends” to make sure every­thing is in a sane state.

For the pack­ages (/data/jails/flavours/mydef/pkg/) I add sym­links to the unver­sioned pack­ages I want to install. I have the pack­ages in a com­mon (think about set­ting PACKAGES in make.conf and using PACKAGES/Latest/XYZ.tbz) direc­to­ry (if they can be shared over var­i­ous flavours), and they are unver­sioned so that I do not have to update the ver­sion num­ber each time there is an update. The pack­ages I install by default are bsd­stats, por­tau­dit, port­mas­ter, zsh, tmux and all their dependencies.

In case you use jails to vir­tu­al­ize ser­vices and con­sol­i­date servers (e.g. DNS, HTTP, MySQL each in a sep­a­rate jail) instead of pro­vid­ing light­weight vir­tu­al machines to (unknown) cus­tomers, there is also a ben­e­fit of shar­ing the dis­t­files and pack­ages between jails on the same machine. To do this I cre­ate /data/jails/flavours/mydef/shared/ports/{distfiles,packages} which are then mount­ed via nullfs or NFS into all the jails from a com­mon direc­to­ry. This requires the fol­low­ing vari­ables in /data/jails/flavours/mydef/etc/make.conf (I also keep the pack­ages for dif­fer­ent CPU types and com­pil­ers in the same sub­tree, if you do not care, just remove the “/${CC}/${CPUTYPE}” from the PACAKGES line):

DISTDIR=  /shared/ports/distfiles
PACKAGES= /shared/ports/packages/${CC}/${CPUTYPE}

New jails

A future post will cov­er how I set­up new jails in such a set­up and how I cus­tomize the start order of jails or use some non-default set­tings for the jail-startup.

Anoth­er root-on-zfs HOWTO (opti­mized for 4k-sector drives)

After 9 years with my cur­rent home-server (one jail for each ser­vice like MySQL, Squid, IMAP, Web­mail, …) I decid­ed that it is time to get some­thing more recent (spe­cial­ly as I want to install some more jails but can not add more mem­o­ry to this i386 system).

With my old sys­tem I had an UFS2-root on a 3‑way-gmirror, swap on a 2‑way-gmirror and my data in a 3‑partition raidz (all in dif­fer­ent slices of the same 3 hard­disks, the 3rd slice which would cor­re­spond to the swap was used as a crash­dump area).

For the new sys­tem I want­ed to go all-ZFS, but I like to have my boot area sep­a­rat­ed from my data area (two pools instead of one big pool). As the machine has 12 GB RAM I also do not con­fig­ure swap areas (at least by default, if I real­ly need some swap I can add some lat­er, see below). The sys­tem has five 1 TB hard­disks and a 60 GB SSD. The hard­disks do not have 4k-sectors, but I expect that there will be more and more 4k-sector dri­ves in the future. As I pre­fer to plan ahead I installed the ZFS pools in a way that they are “4k-ready”. For those which have 4k-sector dri­ves which do not tell the truth but announce they have 512 byte sec­tors (I will call them pseudo-4k-sector dri­ves here) I include a descrip­tion how to prop­er­ly align the (GPT-)partitions.

A major require­ment to boot 4k-sector-size ZFS pools is ZFS v28 (to be cor­rect here, just the boot-code needs to sup­port this, so if you take the pmbr and gptzfs­boot from a ZFS v28 sys­tem, this should work… but I have not test­ed this). As I am run­ning 9‑current, this is not an issue for me.

A quick descrip­tion of the task is to align the partition/slices prop­er­ly for pseudo-4k-sector dri­ves, and then use gnop tem­po­rary dur­ing pool cre­ation time to have ZFS use 4k-sectors dur­ing the life­time of the pool. The long descrip­tion follows.

The lay­out of the drives

The five equal dri­ves are par­ti­tioned with a GUID par­ti­tion table (GPT). Each dri­ve is divid­ed into three par­ti­tions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3‑way mir­ror and the data pool is a raidz2 pool over all 5 disks. The remain­ing space on the two hard­disks which do not take part in the mir­ror­ing of the root pool get swap par­ti­tions of the same size as the root par­ti­tions. One of them is used as a dumpde­vice (this is ‑cur­rent, after all), and the oth­er one stays unused as a cold-standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writ­ing I have not decid­ed yet if I will use it for both pools or only for the data pool.

Cal­cu­lat­ing the offsets

The first sec­tor after the GPT (cre­at­ed with stan­dard set­tings) which can be used as the first sec­tor for a par­ti­tion is sec­tor 34 on a 512 bytes-per-sector dri­ve. On a pseudo-4k-sector dri­ve this would be some­where in the sec­tor 4 of a real 4k-sector, so this is not a good start­ing point. The next 4k-aligned sec­tor on a pseudo-4k-sector dri­ve is sec­tor 40 (sec­tor 5 on a real 4k-sector drive).

The first par­ti­tion is the par­ti­tion for the FreeB­SD boot code. It needs to have enough space for gptzfs­boot. Only allo­cat­ing the space need­ed for gptzfs­boot looks a lit­tle bit dan­ger­ous regard­ing future updates, so my hard­disks are con­fig­ured to allo­cate half a megabyte for it. Addi­tion­al­ly I leave some unused sec­tors as a safe­ty mar­gin after this first partition.

The sec­ond par­ti­tion is the root pool (respec­tive­ly the swap par­ti­tions). I let it start at sec­tor 2048, which would be sec­tor 256 on a real 4k-sector dri­ve (if you do not want to waste less than half a megabyte just cal­cu­late a low­er start sec­tor which is divis­i­ble by 8 (-> start % 8 = 0)). It is a 4 GB par­ti­tion, this is enough for the basesys­tem with some debug ker­nels. Every­thing else (/usr/{src,ports,obj,local}) will be in the data partition.

The last par­ti­tion is direct­ly after the sec­ond and uses the rest of the hard­disk round­ed down to a full GB (if the disk needs to be replaced with a sim­i­lar sized disk there is some safe­ty mar­gin left, as the num­ber of sec­tors in hard­disks fluc­tu­ates a lit­tle bit even in the same mod­els from the same man­u­fac­tur­ing charge). For my hard­disks this means a lit­tle bit more than half a giga­byte of wast­ed stor­age space.

The com­mands to par­ti­tion the disks

In the fol­low­ing I use ada0 as the device of the disk, but it also works with daX or adX or sim­i­lar. I installed one disk from an exist­ing 9‑current sys­tem instead of using some kind of instal­la­tion media (beware, the pool is linked to the sys­tem which cre­ates it, I boot­ed a life-USB image to import it on the new sys­tem and copied the zpool.cache to /boot/zfs/ after import­ing on the new system).

Cre­ate the GPT:

gpart create -s gpt ada0

Cre­ate the boot partition:

gpart add -b 40 -s 1024 -t freebsd-boot ada0

Cre­ate the root/swap par­ti­tions and name them with a GPT label:

gpart add -b 2048 -s 4G -t freebsd-zfs -l rpool0 ada0

or for the swap

gpart add -b 2048 -s 4G -t freebsd-swap -l swap0 ada0

Cre­ate the data par­ti­tion and name them with a GPT label:

gpart add -s 927G -t freebsd-zfs -l data0 ada0

Install the boot code in par­ti­tion 1:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

The result looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        - free -  (3.0k)
          40        1024     1  freebsd-boot  (512k)
        1064         984        - free -  (492k)
        2048     8388608     2  freebsd-zfs  (4.0G)
     8390656  1944059904     3  freebsd-zfs  (927G)
  1952450560     1074575        - free -  (524M)

Cre­ate the pools with 4k-ready inter­nal structures

Cre­at­ing a ZFS pool on one of the ZFS par­ti­tions with­out prepa­ra­tion will not cre­ate a 4k-ready pool on a pseudo-4k-drive. I used gnop (the set­tings do not sur­vive a reboot) to make the par­ti­tion tem­po­rary a 4k-sector par­ti­tion (only the com­mand for the root pool is shown, for the data par­ti­tion gnop has to be used in the same way):

gnop create -S 4096 ada0p2
zpool create -O utf8only=on -o failmode=panic rpool ada0p2.nop
zpool export rpool
gnop destroy ada0p2.nop
zpool import rpool

After the pool is cre­at­ed, it will keep the 4k-sectors set­ting, even when accessed with­out gnop. You can ignore the options I used to cre­ate the pool, they are just my pref­er­ences (and the utf8only set­ting can only be done at pool cre­ation time). If you pre­pare this on a sys­tem which already has a zpool on its own, you can maybe spec­i­fy “-o cachefile=/boot/zfs/zpool2.cache” and copy it to the new pool as zpool.cache to make it bootable with­out the need of a life-image for the new sys­tem (I did not test this).

Ver­i­fy­ing if a pool is pseudo-4k-ready

To ver­i­fy that the pool will use 4k-sectors, you can have a look at the ashift val­ues of the pool (the ashift is per vdev, so if you e.g. con­cat­te­nate sev­er­al mir­rors, the ashift needs to be ver­i­fied for each mir­ror, and if you con­cat­te­nate just a bunch of disks, the ashift needs to be ver­i­fied for all disks). It needs to be 12. To get the ashift val­ue you can use zdb:

zdb rpool | grep ashift

Set­ting up the root pool

One of the ben­e­fits of root-on-zfs is that I can have mul­ti­ple FreeB­SD boot envi­ron­ments (BE). This means that I not only can have sev­er­al dif­fer­ent ker­nels, but also sev­er­al dif­fer­ent user­land ver­sions. To han­dle them com­fort­ably, I use man­ageBE from Philipp Wuen­sche. This requires a spe­cif­ic set­up of the root pool:

zfs create rpool/ROOT
zfs create rpool/ROOT/r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/ROOT/r220832M   # manageBE setting

The r220832M is my ini­tial BE. I use the SVN revi­sion of the source tree which was used dur­ing install of this BE as the name of the BE here. You also need to add the fol­low­ing line to /boot/loader.conf:

vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"

As I want to have a shared /var and /tmp for all my BEs, I cre­ate them separately:

zfs create -o exec=off -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/var rpool/var
zfs create -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/tmp rpool/tmp

As I did this on the old sys­tem, I did not set the mount­points to /var and /tmp, but this has to be done later.

Now the user­land can be installed (e.g. buildworld/installworld/buildkernel/buildkernel/mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not for­get to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an emp­ty /etc/fstab is inside, and con­fig­ure the root as fol­lows (only show­ing what is nec­es­sary for root-on-zfs):

loader.conf:
---snip---
vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"
zfs_load="yes"
opensolaris_load="yes"
---snip---

rc.conf
---snip---
zfs_enable="YES"
---snip---

At this point of the set­up I unmount­ed all zfs on rpool, set the mount­point of rpool/var to /var and of rpool/tmp to /tmp, export­ed the pool and installed the hard­disk in the new sys­tem. After boot­ing a life-USB-image, import­ing the pool, putting the result­ing zpool.cache into the pool (rpool/ROOT/r220832M/boot/zfs/), I reboot­ed into the rpool and attached the oth­er hard­disks to the pool (“zpool attach rpool ada0p2 ada1p2”, “zpool attach rpool ada0p2 ada2p2”):

After updat­ing to a more recent ver­sion of 9‑current, the BE looks like this now:

# ./bin/manageBE list
Poolname: rpool
BE                Active Active Mountpoint           Space
Name              Now    Reboot -                    Used
----              ------ ------ ----------           -----
r221295M          yes    yes    /                    2.66G
cannot open '-': dataset does not exist
r221295M@r221295M no     no     -
r220832M          no     no     /rpool/ROOT/r220832M  561M

Used by BE snapshots: 561M

The lit­tle bug above (the error mes­sage which is prob­a­bly caused by the snap­shot which shows up here prob­a­bly because I use listsnapshots=on) is already report­ed to the author of manageBE.

Jumstart/JET for FreeB­SD (brain­storm­ing)

There are some HOW­TOs out there in the net which describe some auto­mat­ic net­work based install via PXE-booting a machine from a serv­er which has a spe­cif­ic FreeB­SD release in the PXE-booting area and a non-interactive con­fig for sysin­stall to install this FreeB­SD ver­sion on the machine which PXE-boots this.

The set­up of this is com­plete­ly man­u­al and only allows to net­boot one FreeB­SD ver­sion. The server-side set­up for the clients is also com­plete­ly man­u­al (and only allows to install one client at a time, it seems). This is not very user-friendly, and far away from the pow­er of Jumpstart/JET for Solaris where you cre­ate a tem­plate (maybe from anoth­er tem­plate with auto­mat­ic val­ue (IP, name, MAC) replace­ment) and can spec­i­fy dif­fer­ent OS releas­es for dif­fer­ent clients and then just run a com­mand to gen­er­ate a good con­fig for this.

I thought a lit­tle bit how it could be done and decid­ed to write down all the stuff (so far 160 lines, 830 words) to not for­get some details. All in all I think this could be done (at least a sen­si­ble sub­set) in a week or two (full­time) if you have the hard­ware, moti­va­tion, and time. As always, the prob­lems are with­in the details, so I may be off with my esti­ma­tion a lit­tle bit (also depends upon the knowledge-level (shell, tftp, dhcpd, install-software) of the per­son doing this).

Unfor­tu­nate­ly I do not know if I have the hard­ware at home to do some­thing like this. I have some unused hard­disks which could be used in a machine which is used tem­po­rary as a test-install-client (nor­mal­ly I use this machines as my Desk­top… if I do not use my lit­tle Net­book instead, as I do not do much at home cur­rent­ly), but I’ve nev­er checked if this machine is PXE-booting-capable (VIA KT133 chipset with a 3Com 3c905C-TX Fast Ether­link XL). I also do not have the time to do this (with the cur­rent rate of free time I would expect to need about a year), except maybe some­one would call my boss and nego­ti­ate something.

I can not remem­ber any request to have some­thing like this on the freebsd-current, freebsd-arch or freebsd-hackers list since I read them (and that is since about at least 3.0‑RELEASE). Is this because near­ly nobody is inter­est­ed in some­thing like this, or are the cur­rent pos­si­bil­i­ties enough for your needs? Do you work at a place where this would be wel­come (= direct­ly used when it would be done)? If you use a sim­ple solu­tion to make a net-install, what is your expe­ri­ence with this (pros/cons)?

ZFS & power-failure: stable

At the week­end there was a power-failure at our disaster-recovery-site. As every­thing should be con­nect­ed to the UPS, this should not have had an impact… unfor­tu­nate­ly the guys respon­si­ble for the cabling seem to have not pro­vid­ed enough pow­er con­nec­tions from the UPS. Result: one of our stor­age sys­tems (all vol­umes in sev­er­al RAID5 vir­tu­al disks) for the test sys­tems lost pow­er, 10 hard­disks switched into failed state when the pow­er was sta­ble again (I was told there where sev­er­al small power-failures that day). After telling the soft­ware to have a look at the dri­ves again, all phys­i­cal disks where accepted.

All vol­umes on one of the vir­tu­al disks where dam­aged (actu­al­ly, one of the vir­tu­al disks was dam­aged) beyond repair and we had to recov­er from backup.

All ZFS based mount­points on the good vir­tu­al disks did not show bad behav­ior (zfs clear + zfs scrub for those which showed check­sum errors to make us feel bet­ter). For the UFS based ones… some caused a pan­ic after reboot and we had to run fsck on them before try­ing a sec­ond boot.

We spend a lot more time to get UFS back online, than get­ting ZFS back online. After this expe­ri­ence it looks like our future Solaris 10u8 installs will be with root on ZFS (our work­sta­tions are already like this, but our servers are still at Solaris 10u6).