Alexander Leidinger

Just another weblog

May
03

Another root-on-zfs HOWTO (opti­mized for 4k-sector drives)

Updated: June 8, 2011

Extended the ashift info a lit­tle bit to make it more clear in the generic case instead of nar­row­ing it down to just the use-case pre­sented here.

After 9 years with my cur­rent home-server (one jail for each ser­vice like MySQL, Squid, IMAP, Web­mail, …) I decided that it is time to get some­thing more recent (spe­cially as I want to install some more jails but can not add more mem­ory to this i386 system).

With my old sys­tem I had an UFS2-root on a 3-way-gmirror, swap on a 2-way-gmirror and my data in a 3-partition raidz (all in dif­fer­ent slices of the same 3 hard­disks, the 3rd slice which would cor­re­spond to the swap was used as a crash­dump area).

For the new sys­tem I wanted to go all-ZFS, but I like to have my boot area sep­a­rated from my data area (two pools instead of one big pool). As the machine has 12 GB RAM I also do not con­fig­ure swap areas (at least by default, if I really need some swap I can add some later, see below). The sys­tem has five 1 TB hard­disks and a 60 GB SSD. The hard­disks do not have 4k-sectors, but I expect that there will be more and more 4k-sector dri­ves in the future. As I pre­fer to plan ahead I installed the ZFS pools in a way that they are “4k-ready”. For those which have 4k-sector dri­ves which do not tell the truth but announce they have 512 byte sec­tors (I will call them pseudo-4k-sector dri­ves here) I include a descrip­tion how to prop­erly align the (GPT-)partitions.

A major require­ment to boot 4k-sector-size ZFS pools is ZFS v28 (to be cor­rect here, just the boot-code needs to sup­port this, so if you take the pmbr and gptzfs­boot from a ZFS v28 sys­tem, this should work… but I have not tested this). As I am run­ning 9-current, this is not an issue for me.

A quick descrip­tion of the task is to align the partition/slices prop­erly for pseudo-4k-sector dri­ves, and then use gnop tem­po­rary dur­ing pool cre­ation time to have ZFS use 4k-sectors dur­ing the life­time of the pool. The long descrip­tion follows.

The lay­out of the drives

The five equal dri­ves are par­ti­tioned with a GUID par­ti­tion table (GPT). Each drive is divided into three par­ti­tions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3-way mir­ror and the data pool is a raidz2 pool over all 5 disks. The remain­ing space on the two hard­disks which do not take part in the mir­ror­ing of the root pool get swap par­ti­tions of the same size as the root par­ti­tions. One of them is used as a dumpde­vice (this is –cur­rent, after all), and the other one stays unused as a cold-standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writ­ing I have not decided yet if I will use it for both pools or only for the data pool.

Cal­cu­lat­ing the offsets

The first sec­tor after the GPT (cre­ated with stan­dard set­tings) which can be used as the first sec­tor for a par­ti­tion is sec­tor 34 on a 512 bytes-per-sector drive. On a pseudo-4k-sector drive this would be some­where in the sec­tor 4 of a real 4k-sector, so this is not a good start­ing point. The next 4k-aligned sec­tor on a pseudo-4k-sector drive is sec­tor 40 (sec­tor 5 on a real 4k-sector drive).

The first par­ti­tion is the par­ti­tion for the FreeBSD boot code. It needs to have enough space for gptzfs­boot. Only allo­cat­ing the space needed for gptzfs­boot looks a lit­tle bit dan­ger­ous regard­ing future updates, so my hard­disks are con­fig­ured to allo­cate half a megabyte for it. Addi­tion­ally I leave some unused sec­tors as a safety mar­gin after this first partition.

The sec­ond par­ti­tion is the root pool (respec­tively the swap par­ti­tions). I let it start at sec­tor 2048, which would be sec­tor 256 on a real 4k-sector drive (if you do not want to waste less than half a megabyte just cal­cu­late a lower start sec­tor which is divis­i­ble by 8 (-> start % 8 = 0)). It is a 4 GB par­ti­tion, this is enough for the basesys­tem with some debug ker­nels. Every­thing else (/usr/{src,ports,obj,local}) will be in the data partition.

The last par­ti­tion is directly after the sec­ond and uses the rest of the hard­disk rounded down to a full GB (if the disk needs to be replaced with a sim­i­lar sized disk there is some safety mar­gin left, as the num­ber of sec­tors in hard­disks fluc­tu­ates a lit­tle bit even in the same mod­els from the same man­u­fac­tur­ing charge). For my hard­disks this means a lit­tle bit more than half a giga­byte of wasted stor­age space.

The com­mands to par­ti­tion the disks

In the fol­low­ing I use ada0 as the device of the disk, but it also works with daX or adX or sim­i­lar. I installed one disk from an exist­ing 9–cur­rent sys­tem instead of using some kind of instal­la­tion media (beware, the pool is linked to the sys­tem which cre­ates it, I booted a life-USB image to import it on the new sys­tem and copied the zpool.cache to /boot/zfs/ after import­ing on the new system).

Cre­ate the GPT:

gpart create -s gpt ada0

Cre­ate the boot partition:

gpart add -b 40 -s 1024 -t freebsd-boot ada0

Cre­ate the root/swap par­ti­tions and name them with a GPT label:

gpart add -b 2048 -s 4G -t freebsd-zfs -l rpool0 ada0

or for the swap

gpart add -b 2048 -s 4G -t freebsd-swap -l swap0 ada0

Cre­ate the data par­ti­tion and name them with a GPT label:

gpart add -s 927G -t freebsd-zfs -l data0 ada0

Install the boot code in par­ti­tion 1:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

The result looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        - free -  (3.0k)
          40        1024     1  freebsd-boot  (512k)
        1064         984        - free -  (492k)
        2048     8388608     2  freebsd-zfs  (4.0G)
     8390656  1944059904     3  freebsd-zfs  (927G)
  1952450560     1074575        - free -  (524M)

Cre­ate the pools with 4k-ready inter­nal structures

Cre­at­ing a ZFS pool on one of the ZFS par­ti­tions with­out prepa­ra­tion will not cre­ate a 4k-ready pool on a pseudo-4k-drive. I used gnop (the set­tings do not sur­vive a reboot) to make the par­ti­tion tem­po­rary a 4k-sector par­ti­tion (only the com­mand for the root pool is shown, for the data par­ti­tion gnop has to be used in the same way):

gnop create -S 4096 ada0p2
zpool create -O utf8only=on -o failmode=panic rpool ada0p2.nop
zpool export rpool
gnop destroy ada0p2.nop
zpool import rpool

After the pool is cre­ated, it will keep the 4k-sectors set­ting, even when accessed with­out gnop. You can ignore the options I used to cre­ate the pool, they are just my pref­er­ences (and the utf8only set­ting can only be done at pool cre­ation time). If you pre­pare this on a sys­tem which already has a zpool on its own, you can maybe spec­ify “-o cachefile=/boot/zfs/zpool2.cache” and copy it to the new pool as zpool.cache to make it bootable with­out the need of a life-image for the new sys­tem (I did not test this).

Ver­i­fy­ing if a pool is pseudo-4k-ready

To ver­ify that the pool will use 4k-sectors, you can have a look at the ashift val­ues of the pool (the ashift is per vdev, so if you e.g. con­cat­te­nate sev­eral mir­rors, the ashift needs to be ver­i­fied for each mir­ror, and if you con­cat­te­nate just a bunch of disks, the ashift needs to be ver­i­fied for all disks). It needs to be 12. To get the ashift value you can use zdb:

zdb rpool | grep ashift

Set­ting up the root pool

One of the ben­e­fits of root-on-zfs is that I can have mul­ti­ple FreeBSD boot envi­ron­ments (BE). This means that I not only can have sev­eral dif­fer­ent ker­nels, but also sev­eral dif­fer­ent user­land ver­sions. To han­dle them com­fort­ably, I use man­ageBE from Philipp Wuen­sche. This requires a spe­cific setup of the root pool:

zfs create rpool/ROOT
zfs create rpool/ROOT/r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/ROOT/r220832M   # manageBE setting

The r220832M is my ini­tial BE. I use the SVN revi­sion of the source tree which was used dur­ing install of this BE as the name of the BE here. You also need to add the fol­low­ing line to /boot/loader.conf:

vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"

As I want to have a shared /var and /tmp for all my BEs, I cre­ate them separately:

zfs create -o exec=off -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/var rpool/var
zfs create -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/tmp rpool/tmp

As I did this on the old sys­tem, I did not set the mount­points to /var and /tmp, but this has to be done later.

Now the user­land can be installed (e.g. buildworld/installworld/buildkernel/buildkernel/mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not for­get to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an empty /etc/fstab is inside, and con­fig­ure the root as fol­lows (only show­ing what is nec­es­sary for root-on-zfs):

loader.conf:
---snip---
vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"
zfs_load="yes"
opensolaris_load="yes"
---snip---

rc.conf ---snip--- zfs_enable="YES" ---snip---

At this point of the setup I unmounted all zfs on rpool, set the mount­point of rpool/var to /var and of rpool/tmp to /tmp, exported the pool and installed the hard­disk in the new sys­tem. After boot­ing a life-USB-image, import­ing the pool, putting the result­ing zpool.cache into the pool (rpool/ROOT/r220832M/boot/zfs/), I rebooted into the rpool and attached the other hard­disks to the pool (“zpool attach rpool ada0p2 ada1p2”, “zpool attach rpool ada0p2 ada2p2”):

After updat­ing to a more recent ver­sion of 9-current, the BE looks like this now:

# ./bin/manageBE list
Poolname: rpool
BE                Active Active Mountpoint           Space
Name              Now    Reboot -                    Used
----              ------ ------ ----------           -----
r221295M          yes    yes    /                    2.66G
cannot open '-': dataset does not exist
r221295M@r221295M no     no     -
r220832M          no     no     /rpool/ROOT/r220832M  561M

Used by BE snapshots: 561M

The lit­tle bug above (the error mes­sage which is prob­a­bly caused by the snap­shot which shows up here prob­a­bly because I use listsnapshots=on) is already reported to the author of manageBE.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,

13 Responses to “Another root-on-zfs HOWTO (opti­mized for 4k-sector drives)”

  1. Andy Says:

    Hi, I think you have an error in your logic. You men­tion that you check that ashift for the pool is set to 12, but ashift is not a pool wide set­ting. That is you can cre­ate a pool with a sin­gle vdev ini­tially and set ashift 12, but you can then add addi­tional vdevs later and they will only have ashift 12 if they are detected to have 4k sec­tors (ie default would be ashift 9).

    cheers Andy.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  2. netchild Says:

    I updated the ashift info in the arti­cle to not only cover the use-case as pre­sented here, to make it more use­ful for the generic case.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  3. Derrick Says:

    Hi, I just installed ZFS root mir­ror with Cur­rent with 4k sec­tor dri­ves before I saw your arti­cle. Before I tear down my setup is the per­for­mance much improved over 512 sec­tor sizes. Thanks for your post­ing. Great info
    out­put from diskinfo:/dev/ada0
    512 # sec­tor­size
    2000398934016 # medi­a­size in bytes (1.8T)
    3907029168 # medi­a­size in sec­tors
    4096 # stripe­size
    0 # stripe­off­set
    3876021 # Cylin­ders accord­ing to firmware.
    16 # Heads accord­ing to firmware.
    63 # Sec­tors accord­ing to firmware.
    5YD2LLPS # Disk ident.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  4. netchild Says:

    First check if your par­ti­tions are already aligned (man­ual cal­cu­la­tion), and if the pool is using 4k sec­tors (see the zdb com­mand in the arti­cle). From the out­put of diskinfo I have the impres­sion you run a recent –cur­rent which already should do the right thing at least in ZFS (for gpart you may need to spec­ify the new –a option to align a par­ti­tion correctly).

    And yes, if you really have a 4k-sector drive, there is a big speed diff­fer­ence between aligned and unaligned (and 4k-sectors in ZFS or not).

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  5. Derrick Says:

    Thanks for the info.
    I am using CURRENT as of June 12th.

    Unfor­tu­nately, I get 9 when I run zdb.
    zdb zroot | grep ashift
    ashift: 9
    ashift: 9
    I was won­der­ing can I use gpart to resize the disk with the –a option to cor­rect any prob­lems with­out rein­stalling? Like get into sin­gle user mode drop detach filest­stem, gpart resize –a , and then mount filesys­tem again. Would this work?

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  6. netchild Says:

    Align­ing a par­ti­tion means mov­ing the data inside to a dif­fer­ent place. I am not aware that gpart is able to move the data of a partition.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  7. johnny Says:

    it just reboots… To my under­stand­ing, pmbr should some­how call gptzfs­boot which should find zpool.cache which con­tains the result of “zpool set bootfs=/rpool rpool” mounts rpool and then starts /boot/kernel?! but it imme­di­ately reboots whithout mes­sage, so what does not work? O dear god I-m such a noob.… Enlighten me with your insight, master.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  8. netchild Says:

    Your descrip­tion does not con­tain enough infos to be able to help. Can you please describe on fs@FreeBSD.org what you did and in which order? There are more peo­ple (with more time than I have), which should be able to help.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  9. DES Says:

    I see you set the size of the boot par­ti­tion to 512 kB (1024 blocks). You should be aware that the boot code actu­ally loads the entire par­ti­tion, so you want to keep it as small as pos­si­ble. There’s not much point in align­ing it, either, since it’s only read once, at boot time, and never writ­ten to once after instal­la­tion.
    If you start your boot par­ti­tion at off­set 34 (the first avail­able block on a GPT disk) and give it 94 blocks (47 kB), the next par­ti­tion will start at off­set 128, which is a nice round num­ber. If you absolutely must align the boot par­ti­tion, you can place it at off­set 36 with a length of 92 blocks (46 kB). The GPT ZFS second-stage loader, gptzfs­boot, is only about 30 kB, so 92 blocks is plenty, even allow­ing for a rea­son­able amount of future bloat.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  10. johnny Says:

    HEAD gptzfs­boot was appar­ently bro­ken. I installed the one from a 9-CURRENT USB Image, and now every­thing works as expected. Great Per­for­mance. Thank you so much for your com­pe­tent tutorial.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  11. Gavin Says:

    Hi Alex,
    I’m in the process of rebuild­ing my NAS and have upgraded the disks to new WD Caviar Green with Adv. For­mat. One thing which I haven’t been able find any­where is the assur­ance that when using gpart –b to spec­ify the start­ing sec­tor that the util actu­ally does the (sector)-1 for dri­ves where the LBA starts at zero and not 1 (as is the case for most dri­ves).
    I’m assum­ing from your walk through that this is the case and that I can stop worry about the 4k sec­tor align­ment by spec­i­fy­ing the start­ing sec­tor in absolute terms and not the LBA itself?
    Thanks.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  12. netchild Says:

    Gavin, I did not take into account the dif­fer­ence between dri­ves which start at 0 resp. 1. I sug­gest to use des’ tool (can’t remem­ber the name of it) which tests the align­ment of those dri­ves to be sure.

    GD Star Rating
    loading...
    GD Star Rating
    loading...
  13. zfsroot guides – which one to use? » Dan Langille's Other Diary Says:

    […] OPTI­MIZED FOR 4K-SECTOR DRIVES – inter­est­ing, but not set up the way I want my system […]

Leave a Reply