An­oth­er root-​on-​zfs HOWTO (op­tim­ized for 4k-​sector drives)

After 9 years with my cur­rent home-​server (one jail for each ser­vice like MySQL, Squid, IMAP, Web­mail, …) I de­cided that it is time to get some­thing more re­cent (spe­cially as I want to in­stall some more jails but can not add more memory to this i386 sys­tem).

With my old sys­tem I had an UFS2-​root on a 3-​way-​gmirror, swap on a 2-​way-​gmirror and my data in a 3-​partition raidz (all in dif­fer­ent slices of the same 3 hard­disks, the 3rd slice which would cor­res­pond to the swap was used as a crash­dump area).

For the new sys­tem I wanted to go all-​ZFS, but I like to have my boot area sep­ar­ated from my data area (two pools in­stead of one big pool). As the ma­chine has 12 GB RAM I also do not con­fig­ure swap areas (at least by de­fault, if I really need some swap I can add some later, see be­low). The sys­tem has five 1 TB hard­disks and a 60 GB SSD. The hard­disks do not have 4k-​sectors, but I ex­pect that there will be more and more 4k-​sector drives in the fu­ture. As I prefer to plan ahead I in­stalled the ZFS pools in a way that they are “4k-​ready”. For those which have 4k-​sector drives which do not tell the truth but an­nounce they have 512 byte sec­tors (I will call them pseudo-​4k-​sector drives here) I in­clude a de­scrip­tion how to prop­erly align the (GPT-)partitions.

A ma­jor re­quire­ment to boot 4k-​sector-​size ZFS pools is ZFS v28 (to be cor­rect here, just the boot-​code needs to sup­port this, so if you take the pm­br and gptzfs­boot from a ZFS v28 sys­tem, this should work… but I have not tested this). As I am run­ning 9-​current, this is not an is­sue for me.

A quick de­scrip­tion of the task is to align the partition/​slices prop­erly for pseudo-​4k-​sector drives, and then use gnop tem­por­ary dur­ing pool cre­ation time to have ZFS use 4k-​sectors dur­ing the life­time of the pool. The long de­scrip­tion fol­lows.

The lay­out of the drives

The five equal drives are par­ti­tioned with a GUID par­ti­tion table (GPT). Each drive is di­vided in­to three par­ti­tions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3-​way mir­ror and the data pool is a raidz2 pool over all 5 disks. The re­main­ing space on the two hard­disks which do not take part in the mir­ror­ing of the root pool get swap par­ti­tions of the same size as the root par­ti­tions. One of them is used as a dump­device (this is -cur­rent, after all), and the oth­er one stays un­used as a cold-​standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writ­ing I have not de­cided yet if I will use it for both pools or only for the data pool.

Cal­cu­lat­ing the off­sets

The first sec­tor after the GPT (cre­ated with stand­ard set­tings) which can be used as the first sec­tor for a par­ti­tion is sec­tor 34 on a 512 bytes-​per-​sector drive. On a pseudo-​4k-​sector drive this would be some­where in the sec­tor 4 of a real 4k-​sector, so this is not a good start­ing point. The next 4k-​aligned sec­tor on a pseudo-​4k-​sector drive is sec­tor 40 (sec­tor 5 on a real 4k-​sector drive).

The first par­ti­tion is the par­ti­tion for the FreeBSD boot code. It needs to have enough space for gptzfs­boot. Only al­loc­at­ing the space needed for gptzfs­boot looks a little bit dan­ger­ous re­gard­ing fu­ture up­dates, so my hard­disks are con­figured to al­loc­ate half a mega­byte for it. Ad­di­tion­ally I leave some un­used sec­tors as a safety mar­gin after this first par­ti­tion.

The second par­ti­tion is the root pool (re­spect­ively the swap par­ti­tions). I let it start at sec­tor 2048, which would be sec­tor 256 on a real 4k-​sector drive (if you do not want to waste less than half a mega­byte just cal­cu­late a lower start sec­tor which is di­vis­ible by 8 (-> start % 8 = 0)). It is a 4 GB par­ti­tion, this is enough for the basesys­tem with some de­bug ker­nels. Everything else (/usr/{src,ports,obj,local}) will be in the data par­ti­tion.

The last par­ti­tion is dir­ectly after the second and uses the rest of the hard­disk roun­ded down to a full GB (if the disk needs to be re­placed with a sim­il­ar sized disk there is some safety mar­gin left, as the num­ber of sec­tors in hard­disks fluc­tu­ates a little bit even in the same mod­els from the same man­u­fac­tur­ing charge). For my hard­disks this means a little bit more than half a giga­byte of wasted stor­age space.

The com­mands to par­ti­tion the disks

In the fol­low­ing I use ada0 as the device of the disk, but it also works with daX or adX or sim­il­ar. I in­stalled one disk from an ex­ist­ing 9-​current sys­tem in­stead of us­ing some kind of in­stall­a­tion me­dia (be­ware, the pool is linked to the sys­tem which cre­ates it, I booted a life-​USB im­age to im­port it on the new sys­tem and copied the zpool.cache to /​boot/​zfs/​ after im­port­ing on the new sys­tem).

Cre­ate the GPT:

gpart cre­ate -s gpt ada0

Cre­ate the boot par­ti­tion:

gpart add -b 40 -s 1024 -t freebsd-boot ada0

Cre­ate the root/​swap par­ti­tions and name them with a GPT la­bel:

gpart add -b 2048 -s 4G -t freebsd-​zfs -l rpool0 ada0

or for the swap

gpart add -b 2048 -s 4G -t freebsd-​swap -l swap0 ada0

Cre­ate the data par­ti­tion and name them with a GPT la­bel:

gpart add -s 927G -t freebsd-​zfs -l data0 ada0

In­stall the boot code in par­ti­tion 1:

gpart boot­code -b /​boot/​pmbr -p /​boot/​gptzfsboot -i 1 ada0

The res­ult looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        – free –  (3.0k)
          40        1024     1  freebsd-​boot  (512k)
        1064         984        – free –  (492k)
        2048     8388608     2  freebsd-​zfs  (4.0G)
     8390656  1944059904     3  freebsd-​zfs  (927G)
  1952450560     1074575        – free –  (524M)

Cre­ate the pools with 4k-​ready in­tern­al struc­tures

Cre­at­ing a ZFS pool on one of the ZFS par­ti­tions without pre­par­a­tion will not cre­ate a 4k-​ready pool on a pseudo-​4k-​drive. I used gnop (the set­tings do not sur­vive a re­boot) to make the par­ti­tion tem­por­ary a 4k-​sector par­ti­tion (only the com­mand for the root pool is shown, for the data par­ti­tion gnop has to be used in the same way):

gnop cre­ate -S 4096 ada0p2
zpool cre­ate -O utf8only=on -o failmode=panic rpool ada0p2.nop
zpool ex­port rpool
gnop des­troy ada0p2.nop
zpool im­port rpool

After the pool is cre­ated, it will keep the 4k-​sectors set­ting, even when ac­cessed without gnop. You can ig­nore the op­tions I used to cre­ate the pool, they are just my pref­er­ences (and the utf8only set­ting can only be done at pool cre­ation time). If you pre­pare this on a sys­tem which already has a zpool on its own, you can maybe spe­cify “-o cachefile=/boot/zfs/zpool2.cache” and copy it to the new pool as zpool.cache to make it boot­able without the need of a life-​image for the new sys­tem (I did not test this).

Veri­fy­ing if a pool is pseudo-​4k-​ready

To veri­fy that the pool will use 4k-​sectors, you can have a look at the ashift val­ues of the pool (the ashift is per vdev, so if you e.g. con­cat­ten­ate sev­er­al mir­rors, the ashift needs to be veri­fied for each mir­ror, and if you con­cat­ten­ate just a bunch of disks, the ashift needs to be veri­fied for all disks). It needs to be 12. To get the ashift value you can use zdb:

zdb rpool | grep ashift

Set­ting up the root pool

One of the be­ne­fits of root-​on-​zfs is that I can have mul­tiple FreeBSD boot en­vir­on­ments (BE). This means that I not only can have sev­er­al dif­fer­ent ker­nels, but also sev­er­al dif­fer­ent user­land ver­sions. To handle them com­fort­ably, I use man­ageBE from Phil­ipp Wuensche. This re­quires a spe­cif­ic setup of the root pool:

zfs cre­ate rpool/​ROOT
zfs cre­ate rpool/​ROOT/​r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/​ROOT/​r220832M   # man­ageBE set­ting

The r220832M is my ini­tial BE. I use the SVN re­vi­sion of the source tree which was used dur­ing in­stall of this BE as the name of the BE here. You also need to add the fol­low­ing line to /boot/loader.conf:


As I want to have a shared /​var and /​tmp for all my BEs, I cre­ate them sep­ar­ately:

zfs cre­ate -o exec=off -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/var rpool/​var
zfs cre­ate -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/tmp rpool/​tmp

As I did this on the old sys­tem, I did not set the moun­t­points to /​var and /​tmp, but this has to be done later.

Now the user­land can be in­stalled (e.g. buildworld/​installworld/​buildkernel/​buildkernel/​mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not for­get to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an empty /​etc/​fstab is in­side, and con­fig­ure the root as fol­lows (only show­ing what is ne­ces­sary for root-​on-​zfs):



At this point of the setup I un­moun­ted all zfs on rpool, set the moun­t­point of rpool/​var to /​var and of rpool/​tmp to /​tmp, ex­por­ted the pool and in­stalled the hard­disk in the new sys­tem. After boot­ing a life-​USB-​image, im­port­ing the pool, put­ting the res­ult­ing zpool.cache in­to the pool (rpool/​ROOT/​r220832M/​boot/​zfs/​), I re­booted in­to the rpool and at­tached the oth­er hard­disks to the pool (“zpool at­tach rpool ada0p2 ada1p2”, “zpool at­tach rpool ada0p2 ada2p2”):

After up­dat­ing to a more re­cent ver­sion of 9-​current, the BE looks like this now:

# ./​bin/​manageBE list
Pool­name: rpool
BE                Act­ive Act­ive Moun­t­point           Space
Name              Now    Re­boot –                    Used
—-              —— —— — —  — -           —–
r221295M          yes    yes    /​                    2.66G
can­not open „-“: data­set does not ex­ist
r221295M@r221295M no     no     – r220832M          no     no     /​rpool/​ROOT/​r220832M  561M

Used by BE snap­shots: 561M

The little bug above (the er­ror mes­sage which is prob­ably caused by the snap­shot which shows up here prob­ably be­cause I use listsnapshots=on) is already re­por­ted to the au­thor of man­ageBE.