harddisks | Alexander Leidinger

How I setup a Jail-Host

Everyone has his own way of setting up a machine to serve as a host of multiple jails. Here is my way, YMMV.

Initial FreeBSD install

I use several harddisks in a Software-RAID setup. It does not matter much if you set them up with one big partition or with several partitions, feel free to follow your preferences here. My way of partitioning the harddisks is described in a previous post. That post only shows the commands to split the harddisks into two partitions and use ZFS for the rootfs. The commands to initialize the ZFS data partition are not described, but you should be able to figure it out yourself (and you can decide on your own what kind of RAID level you want to use). For this FS I set atime, exec and setuid to off in the ZFS options.

On the ZFS data partition I create a new dataset for the system. For this dataset I set atime, exec and setuid to off in the ZFS options. Inside this dataset I create datasets for /home, /usr/compat, /usr/local, /usr/obj, /usr/ports/, /usr/src, /usr/sup and /var/ports. There are two ways of doing this. One way is to set the ZFS mountpoint. The way I prefer is to set relative symlinks to it, e.g. “cd /usr; ln ‑s ../data/system/usr_obj obj”. I do this because this way I can temporary import the pool on another machine (e.g. my desktop, if the need arises) without fear to interfere with the system. The ZFS options are set as follows:

ZFS options for data/system/*
Dataset	Option	Value
data/system/home	exec	on
data/system/usr_compat	exec	on
data/system/usr_compat	setuid	on
data/system/usr_local	exec	on
data/system/usr_local	setuid	on
data/system/usr_obj	exec	on
data/system/usr_ports	exec	on
data/system/usr_ports	setuid	on
data/system/usr_src	exec	on
data/system/usr_sup	secondarycache	none
data/system/var_ports	exec	on

The exec option for home is not necessary if you keep separate datasets for each user. Normally I keep separate datasets for home directories, but Jail-Hosts should not have users (except the admins, but they should not keep data in their homes), so I just create a single home dataset. The setuid option for the usr_ports should not be necessary if you redirect the build directory of the ports to a different place (WRKDIRPREFIX in /etc/make.conf).

Installing ports

The ports I install by default are net/rsync, ports-mgmt/portaudit, ports-mgmt/portmaster, shells/zsh, sysutils/bsdstats, sysutils/ezjail, sysutils/smartmontools and sysutils/tmux.

Basic setup

In the crontab of root I setup a job to do a portsnap update once a day (I pick a random number between 0 and 59 for the minute, but keep a fixed hour). I also have http_proxy specified in /etc/profile, so that all machines in this network do not download everything from far away again and again, but can get the data from the local caching proxy. As a little watchdog I have a little @reboot rule in the crontab, which notifies me when a machine reboots:

@reboot grep "kernel boot file is" /var/log/messages | mail -s "`hostname` rebooted" root >/dev/null 2>&1

This does not replace a real monitoring solution, but in cases where real monitoring is overkill it provides a nice HEADS-UP (and shows you directly which kernel is loaded in case a non-default one is used).

Some default aliases I use everywhere are:

alias portmlist="portmaster -L | egrep -B1 '(ew|ort) version|Aborting|installed|dependencies|IGNORE|marked|Reason:|MOVED|deleted|exist|update' | grep -v '^--'"
alias portmclean="portmaster -t --clean-distfiles --clean-packages"
alias portmcheck="portmaster -y --check-depends"

Additional devfs rules for Jails

I have the need to give access to some specific devices in some jails. For this I need to setup a custom /etc/devfs.rules file. The files contains some ID numbers which need to be unique in the system. On a 9‑current system the numbers one to four are already used (see /etc/defaults/devfs.rules). The next available number is obviously five then. First I present my devfs.rules entries, then I explain them:

[devfsrules_unhide_audio=5]
add path 'audio*' unhide
add path 'dsp*' unhide
add path midistat unhide
add path 'mixer*' unhide
add path 'music*' unhide
add path 'sequencer*' unhide
add path sndstat unhide
add path speaker unhide

[devfsrules_unhide_printers=6]
add path 'lpt*' unhide
add path 'ulpt*' unhide user 193 group 193
add path 'unlpt*' unhide user 193 group 193

[devfsrules_unhide_zfs=7]
add path zfs unhide

[devfsrules_jail_printserver=8]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_printers
add include $devfsrules_unhide_zfs

[devfsrules_jail_withzfs=9]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_zfs

The devfs_rules_unhide_XXX ones give access to specific devices, e.g. all the sound related devices or to local printers. The devfsrules_jail_XXX ones combine all the unhide rules for specific jail setups. Unfortunately the include directive is not recursive, so that we can not include the default devfsrules_jail profile and need to replicate its contents. The first three includes of each devfsrules_jail_XXX accomplish this. The unhide_zfs rule gives access to /dev/zfs, which is needed if you attach one or more ZFS datasets to a jail. I will explain how to use those profiles with ezjail in a follow-up post.

Jails setup

I use ezjail to manage jails, it is more comfortable than doing it by hand while at the same time allows me to do something by hand. My jails normally reside inside ZFS datasets, for this reason I have setup a special area (ZFS dataset data/jails) which is handled by ezjail.The corresponding ezjail.conf settings are:

ezjail_jaildir=/data/jails
ezjail_use_zfs="YES"
ezjail_jailzfs="data/jails"

I also disabled procfs and fdescfs in jails (but they can be enabled later for specific jails if necessary).

Unfortunately ezjail (as of v3.1) sets the mountpoint of a newly created dataset even if it is not necessary. For this reason I always issue a “zfs inherit mountpoint ” after creating a jail. This simplifies the case where you want to move/rename a dataset and want to have the mountpoint automcatically follow the change.

The access flags of /data/jails directory are 700, this prevents local users (there should be none, but better safe than sorry) to get access to files from users in jails with the same UID.

After the first create/update of the ezjail basejail the ZFS options of basejail (data/jails/basejail) and newjail (data/jails/newjail) need to be changed. For both exec and setuid should be changed to “on” The same needs to be done after creating a new jail for the new jail (before starting it).

The default ezjail flavour

In my default ezjail flavour I create some default user(s) with a basesystem-shell (via /data/jails/flavours/mydef/ezjail.flavour) before the package install, and change the shell to my preferred zsh afterwards (this is only valid if the jails are used only by in-house people, if you want to offer lightweight virtual machines to (unknown) customers, the default user(s) and shell(s) are obviously up to discussion). At the end I also run a “/usr/local/sbin/portmaster ‑y –check-depends” to make sure everything is in a sane state.

For the packages (/data/jails/flavours/mydef/pkg/) I add symlinks to the unversioned packages I want to install. I have the packages in a common (think about setting PACKAGES in make.conf and using PACKAGES/Latest/XYZ.tbz) directory (if they can be shared over various flavours), and they are unversioned so that I do not have to update the version number each time there is an update. The packages I install by default are bsdstats, portaudit, portmaster, zsh, tmux and all their dependencies.

In case you use jails to virtualize services and consolidate servers (e.g. DNS, HTTP, MySQL each in a separate jail) instead of providing lightweight virtual machines to (unknown) customers, there is also a benefit of sharing the distfiles and packages between jails on the same machine. To do this I create /data/jails/flavours/mydef/shared/ports/{distfiles,packages} which are then mounted via nullfs or NFS into all the jails from a common directory. This requires the following variables in /data/jails/flavours/mydef/etc/make.conf (I also keep the packages for different CPU types and compilers in the same subtree, if you do not care, just remove the “/${CC}/${CPUTYPE}” from the PACAKGES line):

DISTDIR=  /shared/ports/distfiles
PACKAGES= /shared/ports/packages/${CC}/${CPUTYPE}

New jails

A future post will cover how I setup new jails in such a setup and how I customize the start order of jails or use some non-default settings for the jail-startup.

Share/Save

Another root-on-zfs HOWTO (optimized for 4k-sector drives)

After 9 years with my current home-server (one jail for each service like MySQL, Squid, IMAP, Webmail, …) I decided that it is time to get something more recent (specially as I want to install some more jails but can not add more memory to this i386 system).

With my old system I had an UFS2-root on a 3‑way-gmirror, swap on a 2‑way-gmirror and my data in a 3‑partition raidz (all in different slices of the same 3 harddisks, the 3^rd slice which would correspond to the swap was used as a crashdump area).

For the new system I wanted to go all-ZFS, but I like to have my boot area separated from my data area (two pools instead of one big pool). As the machine has 12 GB RAM I also do not configure swap areas (at least by default, if I really need some swap I can add some later, see below). The system has five 1 TB harddisks and a 60 GB SSD. The harddisks do not have 4k-sectors, but I expect that there will be more and more 4k-sector drives in the future. As I prefer to plan ahead I installed the ZFS pools in a way that they are “4k-ready”. For those which have 4k-sector drives which do not tell the truth but announce they have 512 byte sectors (I will call them pseudo-4k-sector drives here) I include a description how to properly align the (GPT-)partitions.

A major requirement to boot 4k-sector-size ZFS pools is ZFS v28 (to be correct here, just the boot-code needs to support this, so if you take the pmbr and gptzfsboot from a ZFS v28 system, this should work… but I have not tested this). As I am running 9‑current, this is not an issue for me.

A quick description of the task is to align the partition/slices properly for pseudo-4k-sector drives, and then use gnop temporary during pool creation time to have ZFS use 4k-sectors during the lifetime of the pool. The long description follows.

The layout of the drives

The five equal drives are partitioned with a GUID partition table (GPT). Each drive is divided into three partitions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3‑way mirror and the data pool is a raidz2 pool over all 5 disks. The remaining space on the two harddisks which do not take part in the mirroring of the root pool get swap partitions of the same size as the root partitions. One of them is used as a dumpdevice (this is ‑current, after all), and the other one stays unused as a cold-standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writing I have not decided yet if I will use it for both pools or only for the data pool.

Calculating the offsets

The first sector after the GPT (created with standard settings) which can be used as the first sector for a partition is sector 34 on a 512 bytes-per-sector drive. On a pseudo-4k-sector drive this would be somewhere in the sector 4 of a real 4k-sector, so this is not a good starting point. The next 4k-aligned sector on a pseudo-4k-sector drive is sector 40 (sector 5 on a real 4k-sector drive).

The first partition is the partition for the FreeBSD boot code. It needs to have enough space for gptzfsboot. Only allocating the space needed for gptzfsboot looks a little bit dangerous regarding future updates, so my harddisks are configured to allocate half a megabyte for it. Additionally I leave some unused sectors as a safety margin after this first partition.

The second partition is the root pool (respectively the swap partitions). I let it start at sector 2048, which would be sector 256 on a real 4k-sector drive (if you do not want to waste less than half a megabyte just calculate a lower start sector which is divisible by 8 (-> start % 8 = 0)). It is a 4 GB partition, this is enough for the basesystem with some debug kernels. Everything else (/usr/{src,ports,obj,local}) will be in the data partition.

The last partition is directly after the second and uses the rest of the harddisk rounded down to a full GB (if the disk needs to be replaced with a similar sized disk there is some safety margin left, as the number of sectors in harddisks fluctuates a little bit even in the same models from the same manufacturing charge). For my harddisks this means a little bit more than half a gigabyte of wasted storage space.

The commands to partition the disks

In the following I use ada0 as the device of the disk, but it also works with daX or adX or similar. I installed one disk from an existing 9‑current system instead of using some kind of installation media (beware, the pool is linked to the system which creates it, I booted a life-USB image to import it on the new system and copied the zpool.cache to /boot/zfs/ after importing on the new system).

Create the GPT:

gpart create -s gpt ada0

Create the boot partition:

gpart add -b 40 -s 1024 -t freebsd-boot ada0

Create the root/swap partitions and name them with a GPT label:

gpart add -b 2048 -s 4G -t freebsd-zfs -l rpool0 ada0

or for the swap

gpart add -b 2048 -s 4G -t freebsd-swap -l swap0 ada0

Create the data partition and name them with a GPT label:

gpart add -s 927G -t freebsd-zfs -l data0 ada0

Install the boot code in partition 1:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

The result looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        - free -  (3.0k)
          40        1024     1  freebsd-boot  (512k)
        1064         984        - free -  (492k)
        2048     8388608     2  freebsd-zfs  (4.0G)
     8390656  1944059904     3  freebsd-zfs  (927G)
  1952450560     1074575        - free -  (524M)

Create the pools with 4k-ready internal structures

Creating a ZFS pool on one of the ZFS partitions without preparation will not create a 4k-ready pool on a pseudo-4k-drive. I used gnop (the settings do not survive a reboot) to make the partition temporary a 4k-sector partition (only the command for the root pool is shown, for the data partition gnop has to be used in the same way):

gnop create -S 4096 ada0p2
zpool create -O utf8only=on -o failmode=panic rpool ada0p2.nop
zpool export rpool
gnop destroy ada0p2.nop
zpool import rpool

After the pool is created, it will keep the 4k-sectors setting, even when accessed without gnop. You can ignore the options I used to create the pool, they are just my preferences (and the utf8only setting can only be done at pool creation time). If you prepare this on a system which already has a zpool on its own, you can maybe specify “-o cachefile=/boot/zfs/zpool2.cache” and copy it to the new pool as zpool.cache to make it bootable without the need of a life-image for the new system (I did not test this).

Verifying if a pool is pseudo-4k-ready

To verify that the pool will use 4k-sectors, you can have a look at the ashift values of the pool (the ashift is per vdev, so if you e.g. concattenate several mirrors, the ashift needs to be verified for each mirror, and if you concattenate just a bunch of disks, the ashift needs to be verified for all disks). It needs to be 12. To get the ashift value you can use zdb:

zdb rpool | grep ashift

Setting up the root pool

One of the benefits of root-on-zfs is that I can have multiple FreeBSD boot environments (BE). This means that I not only can have several different kernels, but also several different userland versions. To handle them comfortably, I use manageBE from Philipp Wuensche. This requires a specific setup of the root pool:

zfs create rpool/ROOT
zfs create rpool/ROOT/r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/ROOT/r220832M   # manageBE setting

The r220832M is my initial BE. I use the SVN revision of the source tree which was used during install of this BE as the name of the BE here. You also need to add the following line to /boot/loader.conf:

vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"

As I want to have a shared /var and /tmp for all my BEs, I create them separately:

zfs create -o exec=off -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/var rpool/var
zfs create -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/tmp rpool/tmp

As I did this on the old system, I did not set the mountpoints to /var and /tmp, but this has to be done later.

Now the userland can be installed (e.g. buildworld/installworld/buildkernel/buildkernel/mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not forget to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an empty /etc/fstab is inside, and configure the root as follows (only showing what is necessary for root-on-zfs):

loader.conf:
---snip---
vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"
zfs_load="yes"
opensolaris_load="yes"
---snip---

rc.conf
---snip---
zfs_enable="YES"
---snip---

At this point of the setup I unmounted all zfs on rpool, set the mountpoint of rpool/var to /var and of rpool/tmp to /tmp, exported the pool and installed the harddisk in the new system. After booting a life-USB-image, importing the pool, putting the resulting zpool.cache into the pool (rpool/ROOT/r220832M/boot/zfs/), I rebooted into the rpool and attached the other harddisks to the pool (“zpool attach rpool ada0p2 ada1p2”, “zpool attach rpool ada0p2 ada2p2”):

After updating to a more recent version of 9‑current, the BE looks like this now:

# ./bin/manageBE list
Poolname: rpool
BE                Active Active Mountpoint           Space
Name              Now    Reboot -                    Used
----              ------ ------ ----------           -----
r221295M          yes    yes    /                    2.66G
cannot open '-': dataset does not exist
r221295M@r221295M no     no     -
r220832M          no     no     /rpool/ROOT/r220832M  561M

Used by BE snapshots: 561M

The little bug above (the error message which is probably caused by the snapshot which shows up here probably because I use listsnapshots=on) is already reported to the author of manageBE.

Share/Save

Jumstart/JET for FreeBSD (brainstorming)

There are some HOWTOs out there in the net which describe some automatic network based install via PXE-booting a machine from a server which has a specific FreeBSD release in the PXE-booting area and a non-interactive config for sysinstall to install this FreeBSD version on the machine which PXE-boots this.

The setup of this is completely manual and only allows to netboot one FreeBSD version. The server-side setup for the clients is also completely manual (and only allows to install one client at a time, it seems). This is not very user-friendly, and far away from the power of Jumpstart/JET for Solaris where you create a template (maybe from another template with automatic value (IP, name, MAC) replacement) and can specify different OS releases for different clients and then just run a command to generate a good config for this.

I thought a little bit how it could be done and decided to write down all the stuff (so far 160 lines, 830 words) to not forget some details. All in all I think this could be done (at least a sensible subset) in a week or two (fulltime) if you have the hardware, motivation, and time. As always, the problems are within the details, so I may be off with my estimation a little bit (also depends upon the knowledge-level (shell, tftp, dhcpd, install-software) of the person doing this).

Unfortunately I do not know if I have the hardware at home to do something like this. I have some unused harddisks which could be used in a machine which is used temporary as a test-install-client (normally I use this machines as my Desktop… if I do not use my little Netbook instead, as I do not do much at home currently), but I’ve never checked if this machine is PXE-booting-capable (VIA KT133 chipset with a 3Com 3c905C-TX Fast Etherlink XL). I also do not have the time to do this (with the current rate of free time I would expect to need about a year), except maybe someone would call my boss and negotiate something.

I can not remember any request to have something like this on the freebsd-current, freebsd-arch or freebsd-hackers list since I read them (and that is since about at least 3.0‑RELEASE). Is this because nearly nobody is interested in something like this, or are the current possibilities enough for your needs? Do you work at a place where this would be welcome (= directly used when it would be done)? If you use a simple solution to make a net-install, what is your experience with this (pros/cons)?

Share/Save

ZFS & power-failure: stable

At the weekend there was a power-failure at our disaster-recovery-site. As everything should be connected to the UPS, this should not have had an impact… unfortunately the guys responsible for the cabling seem to have not provided enough power connections from the UPS. Result: one of our storage systems (all volumes in several RAID5 virtual disks) for the test systems lost power, 10 harddisks switched into failed state when the power was stable again (I was told there where several small power-failures that day). After telling the software to have a look at the drives again, all physical disks where accepted.

All volumes on one of the virtual disks where damaged (actually, one of the virtual disks was damaged) beyond repair and we had to recover from backup.

All ZFS based mountpoints on the good virtual disks did not show bad behavior (zfs clear + zfs scrub for those which showed checksum errors to make us feel better). For the UFS based ones… some caused a panic after reboot and we had to run fsck on them before trying a second boot.

We spend a lot more time to get UFS back online, than getting ZFS back online. After this experience it looks like our future Solaris 10u8 installs will be with root on ZFS (our workstations are already like this, but our servers are still at Solaris 10u6).

Share/Save

Ini­tial FreeB­SD install

Installing ports

Basic set­up

Addi­tion­al devfs rules for Jails

Jails set­up

The default ezjail flavour

New jails

The lay­out of the drives

Cal­cu­lat­ing the offsets

The com­mands to par­ti­tion the disks

Cre­ate the pools with 4k-ready inter­nal structures

Ver­i­fy­ing if a pool is pseudo-4k-ready

Set­ting up the root pool

Initial FreeBSD install

Basic setup

Additional devfs rules for Jails

Jails setup

The layout of the drives

Calculating the offsets

The commands to partition the disks

Create the pools with 4k-ready internal structures

Verifying if a pool is pseudo-4k-ready

Setting up the root pool