An­other root-​on-​zfs HOWTO (op­tim­ized for 4k-​sector drives)

Updated: June 8, 2011

Extended the ashift info a little bit to make it more clear in the generic case instead of narrowing it down to just the use-case presented here.

After 9 years with my current home-server (one jail for each service like MySQL, Squid, IMAP, Webmail, ...) I decided that it is time to get something more recent (specially as I want to install some more jails but can not add more memory to this i386 system).

With my old system I had an UFS2-root on a 3-way-gmirror, swap on a 2-way-gmirror and my data in a 3-partition raidz (all in different slices of the same 3 harddisks, the 3rd slice which would correspond to the swap was used as a crashdump area).

For the new system I wanted to go all-ZFS, but I like to have my boot area separated from my data area (two pools instead of one big pool). As the machine has 12 GB RAM I also do not configure swap areas (at least by default, if I really need some swap I can add some later, see below). The system has five 1 TB harddisks and a 60 GB SSD. The harddisks do not have 4k-sectors, but I expect that there will be more and more 4k-sector drives in the future. As I prefer to plan ahead I installed the ZFS pools in a way that they are "4k-ready". For those which have 4k-sector drives which do not tell the truth but announce they have 512 byte sectors (I will call them pseudo-4k-sector drives here) I include a description how to properly align the (GPT-)partitions.

A major requirement to boot 4k-sector-size ZFS pools is ZFS v28 (to be correct here, just the boot-code needs to support this, so if you take the pmbr and gptzfsboot from a ZFS v28 system, this should work... but I have not tested this). As I am running 9-current, this is not an issue for me.

A quick description of the task is to align the partition/slices properly for pseudo-4k-sector drives, and then use gnop temporary during pool creation time to have ZFS use 4k-sectors during the lifetime of the pool. The long description follows.

The layout of the drives

The five equal drives are partitioned with a GUID partition table (GPT). Each drive is divided into three partitions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3-way mirror and the data pool is a raidz2 pool over all 5 disks. The remaining space on the two harddisks which do not take part in the mirroring of the root pool get swap partitions of the same size as the root partitions. One of them is used as a dumpdevice (this is -current, after all), and the other one stays unused as a cold-standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writing I have not decided yet if I will use it for both pools or only for the data pool.

Calculating the offsets

The first sector after the GPT (created with standard settings) which can be used as the first sector for a partition is sector 34 on a 512 bytes-per-sector drive. On a pseudo-4k-sector drive this would be somewhere in the sector 4 of a real 4k-sector, so this is not a good starting point. The next 4k-aligned sector on a pseudo-4k-sector drive is sector 40 (sector 5 on a real 4k-sector drive).

The first partition is the partition for the FreeBSD boot code. It needs to have enough space for gptzfsboot. Only allocating the space needed for gptzfsboot looks a little bit dangerous regarding future updates, so my harddisks are configured to allocate half a megabyte for it. Additionally I leave some unused sectors as a safety margin after this first partition.

The second partition is the root pool (respectively the swap partitions). I let it start at sector 2048, which would be sector 256 on a real 4k-sector drive (if you do not want to waste less than half a megabyte just calculate a lower start sector which is divisible by 8 (-> start % 8 = 0)). It is a 4 GB partition, this is enough for the basesystem with some debug kernels. Everything else (/usr/{src,ports,obj,local}) will be in the data partition.

The last partition is directly after the second and uses the rest of the harddisk rounded down to a full GB (if the disk needs to be replaced with a similar sized disk there is some safety margin left, as the number of sectors in harddisks fluctuates a little bit even in the same models from the same manufacturing charge). For my harddisks this means a little bit more than half a gigabyte of wasted storage space.

The commands to partition the disks

In the following I use ada0 as the device of the disk, but it also works with daX or adX or similar. I installed one disk from an existing 9-current system instead of using some kind of installation media (beware, the pool is linked to the system which creates it, I booted a life-USB image to import it on the new system and copied the zpool.cache to /boot/zfs/ after importing on the new system).

Create the GPT:

gpart create -s gpt ada0

Create the boot partition:

gpart add -b 40 -s 1024 -t freebsd-boot ada0

Create the root/swap partitions and name them with a GPT label:

gpart add -b 2048 -s 4G -t freebsd-zfs -l rpool0 ada0

or for the swap

gpart add -b 2048 -s 4G -t freebsd-swap -l swap0 ada0

Create the data partition and name them with a GPT label:

gpart add -s 927G -t freebsd-zfs -l data0 ada0

Install the boot code in partition 1:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

The result looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        - free -  (3.0k)
          40        1024     1  freebsd-boot  (512k)
        1064         984        - free -  (492k)
        2048     8388608     2  freebsd-zfs  (4.0G)
     8390656  1944059904     3  freebsd-zfs  (927G)
  1952450560     1074575        - free -  (524M)

Create the pools with 4k-ready internal structures

Creating a ZFS pool on one of the ZFS partitions without preparation will not create a 4k-ready pool on a pseudo-4k-drive. I used gnop (the settings do not survive a reboot) to make the partition temporary a 4k-sector partition (only the command for the root pool is shown, for the data partition gnop has to be used in the same way):

gnop create -S 4096 ada0p2
zpool create -O utf8only=on -o failmode=panic rpool ada0p2.nop
zpool export rpool
gnop destroy ada0p2.nop
zpool import rpool

After the pool is created, it will keep the 4k-sectors setting, even when accessed without gnop. You can ignore the options I used to create the pool, they are just my preferences (and the utf8only setting can only be done at pool creation time). If you prepare this on a system which already has a zpool on its own, you can maybe specify "-o cachefile=/boot/zfs/zpool2.cache" and copy it to the new pool as zpool.cache to make it bootable without the need of a life-image for the new system (I did not test this).

Verifying if a pool is pseudo-4k-ready

To verify that the pool will use 4k-sectors, you can have a look at the ashift values of the pool (the ashift is per vdev, so if you e.g. concattenate several mirrors, the ashift needs to be verified for each mirror, and if you concattenate just a bunch of disks, the ashift needs to be verified for all disks). It needs to be 12. To get the ashift value you can use zdb:

zdb rpool | grep ashift

Setting up the root pool

One of the benefits of root-on-zfs is that I can have multiple FreeBSD boot environments (BE). This means that I not only can have several different kernels, but also several different userland versions. To handle them comfortably, I use manageBE from Philipp Wuensche. This requires a specific setup of the root pool:

zfs create rpool/ROOT
zfs create rpool/ROOT/r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/ROOT/r220832M   # manageBE setting

The r220832M is my initial BE. I use the SVN revision of the source tree which was used during install of this BE as the name of the BE here. You also need to add the following line to /boot/loader.conf:

vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"

As I want to have a shared /var and /tmp for all my BEs, I create them separately:

zfs create -o exec=off -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/var rpool/var
zfs create -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/tmp rpool/tmp

As I did this on the old system, I did not set the mountpoints to /var and /tmp, but this has to be done later.

Now the userland can be installed (e.g. buildworld/installworld/buildkernel/buildkernel/mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not forget to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an empty /etc/fstab is inside, and configure the root as follows (only showing what is necessary for root-on-zfs):

loader.conf:
---snip---
vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"
zfs_load="yes"
opensolaris_load="yes"
---snip---

rc.conf ---snip--- zfs_enable="YES" ---snip---

At this point of the setup I unmounted all zfs on rpool, set the mountpoint of rpool/var to /var and of rpool/tmp to /tmp, exported the pool and installed the harddisk in the new system. After booting a life-USB-image, importing the pool, putting the resulting zpool.cache into the pool (rpool/ROOT/r220832M/boot/zfs/), I rebooted into the rpool and attached the other harddisks to the pool ("zpool attach rpool ada0p2 ada1p2", "zpool attach rpool ada0p2 ada2p2"):

After updating to a more recent version of 9-current, the BE looks like this now:

# ./bin/manageBE list
Poolname: rpool
BE                Active Active Mountpoint           Space
Name              Now    Reboot -                    Used
----              ------ ------ ----------           -----
r221295M          yes    yes    /                    2.66G
cannot open '-': dataset does not exist
r221295M@r221295M no     no     -
r220832M          no     no     /rpool/ROOT/r220832M  561M

Used by BE snapshots: 561M

The little bug above (the error message which is probably caused by the snapshot which shows up here probably because I use listsnapshots=on) is already reported to the author of manageBE.

13 thoughts on “An­other root-​on-​zfs HOWTO (op­tim­ized for 4k-​sector drives)”

  1. Hi, I think you have an er­ror in your lo­gic. You men­tion that you check that ashift for the pool is set to 12, but ashift is not a pool wide set­ting. That is you can cre­ate a pool with a single vdev ini­tially and set ashift 12, but you can then add ad­di­tion­al vdevs later and they will only have ashift 12 if they are de­tec­ted to have 4k sec­tors (ie de­fault would be ashift 9).

    cheers Andy.

    1. I up­dated the ashift in­fo in the art­icle to not only cov­er the use-​case as presen­ted here, to make it more use­ful for the gen­er­ic case.

  2. Hi, I just in­stalled ZFS root mir­ror with Cur­rent with 4k sec­tor drives be­fore I saw your art­icle. Be­fore I tear down my setup is the per­form­ance much im­proved over 512 sec­tor sizes. Thanks for your post­ing. Great in­fo
    out­put from diskinfo:/dev/ada0
    512 # sec­tor­size
    2000398934016 # me­di­as­ize in bytes (1.8T)
    3907029168 # me­di­as­ize in sec­tors
    4096 # stripes­ize
    0 # stripeoff­set
    3876021 # Cyl­in­ders ac­cord­ing to firm­ware.
    16 # Heads ac­cord­ing to firm­ware.
    63 # Sec­tors ac­cord­ing to firm­ware.
    5YD2LLPS # Disk ident.

    1. First check if your par­ti­tions are already aligned (manu­al cal­cu­la­tion), and if the pool is us­ing 4k sec­tors (see the zdb com­mand in the art­icle). From the out­put of diskin­fo I have the im­pres­sion you run a re­cent –cur­rent which already should do the right thing at least in ZFS (for gpart you may need to spe­cify the new –a op­tion to align a par­ti­tion cor­rectly).

      And yes, if you really have a 4k-​sector drive, there is a big speed diff­fer­ence between aligned and un­aligned (and 4k-​sectors in ZFS or not).

  3. Thanks for the in­fo.
    I am us­ing CURRENT as of June 12th.

    Un­for­tu­nately, I get 9 when I run zdb.
    zdb zroot | grep ashift
    ashift: 9
    ashift: 9
    I was won­der­ing can I use gpart to res­ize the disk with the –a op­tion to cor­rect any prob­lems without re­in­stalling? Like get in­to single user mode drop de­tach filest­stem, gpart res­ize –a , and then mount filesys­tem again. Would this work?

    1. Align­ing a par­ti­tion means mov­ing the data in­side to a dif­fer­ent place. I am not aware that gpart is able to move the data of a par­ti­tion.

  4. it just re­boots… To my un­der­stand­ing, pm­br should some­how call gptzfs­boot which should find zpool.cache which con­tains the res­ult of “zpool set bootfs=/rpool rpool” mounts rpool and then starts /​boot/​kernel?! but it im­me­di­ately re­boots whithout mes­sage, so what does not work? O dear god I-​m such a noob.… En­lighten me with your in­sight, mas­ter.

    1. Your de­scrip­tion does not con­tain enough in­fos to be able to help. Can you please de­scribe on fs@​FreeBSD.​org what you did and in which or­der? There are more people (with more time than I have), which should be able to help.

  5. I see you set the size of the boot par­ti­tion to 512 kB (1024 blocks). You should be aware that the boot code ac­tu­ally loads the en­tire par­ti­tion, so you want to keep it as small as pos­sible. There’s not much point in align­ing it, either, since it’s only read once, at boot time, and nev­er writ­ten to once af­ter in­stall­a­tion.
    If you start your boot par­ti­tion at off­set 34 (the first avail­able block on a GPT disk) and give it 94 blocks (47 kB), the next par­ti­tion will start at off­set 128, which is a nice round num­ber. If you ab­so­lutely must align the boot par­ti­tion, you can place it at off­set 36 with a length of 92 blocks (46 kB). The GPT ZFS second-​stage load­er, gptzfs­boot, is only about 30 kB, so 92 blocks is plenty, even al­low­ing for a reas­on­able amount of fu­ture bloat.

  6. HEAD gptzfs­boot was ap­par­ently broken. I in­stalled the one from a 9-​CURRENT USB Im­age, and now everything works as ex­pec­ted. Great Per­form­ance. Thank you so much for your com­pet­ent tu­tori­al.

  7. Hi Alex,
    I’m in the pro­cess of re­build­ing my NAS and have up­graded the disks to new WD Cavi­ar Green with Adv. Form­at. One thing which I haven’t been able find any­where is the as­sur­ance that when us­ing gpart –b to spe­cify the start­ing sec­tor that the util ac­tu­ally does the (sector)-1 for drives where the LBA starts at zero and not 1 (as is the case for most drives).
    I’m as­sum­ing from your walk through that this is the case and that I can stop worry about the 4k sec­tor align­ment by spe­cify­ing the start­ing sec­tor in ab­so­lute terms and not the LBA it­self?
    Thanks.

  8. Gav­in, I did not take in­to ac­count the dif­fer­ence between drives which start at 0 resp. 1. I sug­gest to use des“ tool (can’t re­mem­ber the name of it) which tests the align­ment of those drives to be sure.

Leave a Reply

Your email address will not be published. Required fields are marked *