An­other root-​on-​zfs HOWTO (op­tim­ized for 4k-​sector drives)

Up­dated: June 8, 2011

Ex­ten­ded the ashift info a little bit to make it more clear in the gen­eric case in­stead of nar­row­ing it down to just the use-​case presen­ted here.

After 9 years with my cur­rent home-​server (one jail for each ser­vice like MySQL, Squid, IMAP, Web­mail, …) I de­cided that it is time to get some­thing more re­cent (spe­cially as I want to in­stall some more jails but can not add more memory to this i386 sys­tem).

With my old sys­tem I had an UFS2-​root on a 3-​way-​gmirror, swap on a 2-​way-​gmirror and my data in a 3-​partition raidz (all in dif­fer­ent slices of the same 3 hard­disks, the 3rd slice which would cor­res­pond to the swap was used as a crash­dump area).

For the new sys­tem I wanted to go all-​ZFS, but I like to have my boot area sep­ar­ated from my data area (two pools in­stead of one big pool). As the ma­chine has 12 GB RAM I also do not con­fig­ure swap areas (at least by de­fault, if I really need some swap I can add some later, see be­low). The sys­tem has five 1 TB hard­disks and a 60 GB SSD. The hard­disks do not have 4k-​sectors, but I ex­pect that there will be more and more 4k-​sector drives in the fu­ture. As I prefer to plan ahead I in­stalled the ZFS pools in a way that they are “4k-​ready”. For those which have 4k-​sector drives which do not tell the truth but an­nounce they have 512 byte sec­tors (I will call them pseudo-​4k-​sector drives here) I in­clude a de­scrip­tion how to prop­erly align the (GPT-)partitions.

A ma­jor re­quire­ment to boot 4k-​sector-​size ZFS pools is ZFS v28 (to be cor­rect here, just the boot-​code needs to sup­port this, so if you take the pmbr and gptzfs­boot from a ZFS v28 sys­tem, this should work… but I have not tested this). As I am run­ning 9-​current, this is not an is­sue for me.

A quick de­scrip­tion of the task is to align the partition/​slices prop­erly for pseudo-​4k-​sector drives, and then use gnop tem­por­ary dur­ing pool cre­ation time to have ZFS use 4k-​sectors dur­ing the life­time of the pool. The long de­scrip­tion fol­lows.

The lay­out of the drives

The five equal drives are par­ti­tioned with a GUID par­ti­tion table (GPT). Each drive is di­vided into three par­ti­tions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3-​way mir­ror and the data pool is a raidz2 pool over all 5 disks. The re­main­ing space on the two hard­disks which do not take part in the mir­ror­ing of the root pool get swap par­ti­tions of the same size as the root par­ti­tions. One of them is used as a dump­device (this is –cur­rent, after all), and the other one stays un­used as a cold-​standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writ­ing I have not de­cided yet if I will use it for both pools or only for the data pool.

Cal­cu­lat­ing the off­sets

The first sec­tor after the GPT (cre­ated with stand­ard set­tings) which can be used as the first sec­tor for a par­ti­tion is sec­tor 34 on a 512 bytes-​per-​sector drive. On a pseudo-​4k-​sector drive this would be some­where in the sec­tor 4 of a real 4k-​sector, so this is not a good start­ing point. The next 4k-​aligned sec­tor on a pseudo-​4k-​sector drive is sec­tor 40 (sec­tor 5 on a real 4k-​sector drive).

The first par­ti­tion is the par­ti­tion for the FreeBSD boot code. It needs to have enough space for gptzfs­boot. Only al­loc­at­ing the space needed for gptzfs­boot looks a little bit dan­ger­ous re­gard­ing fu­ture up­dates, so my hard­disks are con­figured to al­loc­ate half a mega­byte for it. Ad­di­tion­ally I leave some un­used sec­tors as a safety mar­gin after this first par­ti­tion.

The second par­ti­tion is the root pool (re­spect­ively the swap par­ti­tions). I let it start at sec­tor 2048, which would be sec­tor 256 on a real 4k-​sector drive (if you do not want to waste less than half a mega­byte just cal­cu­late a lower start sec­tor which is di­vis­ible by 8 (-> start % 8 = 0)). It is a 4 GB par­ti­tion, this is enough for the basesys­tem with some de­bug ker­nels. Everything else (/usr/{src,ports,obj,local}) will be in the data par­ti­tion.

The last par­ti­tion is dir­ectly after the second and uses the rest of the hard­disk roun­ded down to a full GB (if the disk needs to be re­placed with a sim­ilar sized disk there is some safety mar­gin left, as the num­ber of sec­tors in hard­disks fluc­tu­ates a little bit even in the same mod­els from the same man­u­fac­tur­ing charge). For my hard­disks this means a little bit more than half a giga­byte of wasted stor­age space.

The com­mands to par­ti­tion the disks

In the fol­low­ing I use ada0 as the device of the disk, but it also works with daX or adX or sim­ilar. I in­stalled one disk from an ex­ist­ing 9-​current sys­tem in­stead of us­ing some kind of in­stall­a­tion me­dia (be­ware, the pool is linked to the sys­tem which cre­ates it, I booted a life-​USB im­age to im­port it on the new sys­tem and copied the zpool.cache to /​boot/​zfs/​ after im­port­ing on the new sys­tem).

Cre­ate the GPT:

gpart cre­ate –s gpt ada0

Cre­ate the boot par­ti­tion:

gpart add –b 40 –s 1024 –t freebsd–boot ada0

Cre­ate the root/​swap par­ti­tions and name them with a GPT la­bel:

gpart add –b 2048 –s 4G –t freebsd-​zfs –l rpool0 ada0

or for the swap

gpart add –b 2048 –s 4G –t freebsd-​swap –l swap0 ada0

Cre­ate the data par­ti­tion and name them with a GPT la­bel:

gpart add –s 927G –t freebsd-​zfs –l data0 ada0

In­stall the boot code in par­ti­tion 1:

gpart boot­code –b /​boot/​pmbr –p /​boot/​gptzfsboot –i 1 ada0

The res­ult looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        – free –  (3.0k)
          40        1024     1  freebsd-​boot  (512k)
        1064         984        – free –  (492k)
        2048     8388608     2  freebsd-​zfs  (4.0G)
     8390656  1944059904     3  freebsd-​zfs  (927G)
  1952450560     1074575        – free –  (524M)

Cre­ate the pools with 4k-​ready in­ternal struc­tures

Cre­at­ing a ZFS pool on one of the ZFS par­ti­tions without pre­par­a­tion will not cre­ate a 4k-​ready pool on a pseudo-​4k-​drive. I used gnop (the set­tings do not sur­vive a re­boot) to make the par­ti­tion tem­por­ary a 4k-​sector par­ti­tion (only the com­mand for the root pool is shown, for the data par­ti­tion gnop has to be used in the same way):

gnop cre­ate –S 4096 ada0p2
zpool cre­ate –O utf8only=on –o failmode=panic rpool ada0p2.nop
zpool ex­port rpool
gnop des­troy ada0p2.nop
zpool im­port rpool

After the pool is cre­ated, it will keep the 4k-​sectors set­ting, even when ac­cessed without gnop. You can ig­nore the op­tions I used to cre­ate the pool, they are just my pref­er­ences (and the utf8only set­ting can only be done at pool cre­ation time). If you pre­pare this on a sys­tem which already has a zpool on its own, you can maybe spe­cify “-o cachefile=/boot/zfs/zpool2.cache” and copy it to the new pool as zpool.cache to make it boot­able without the need of a life-​image for the new sys­tem (I did not test this).

Veri­fy­ing if a pool is pseudo-​4k-​ready

To verify that the pool will use 4k-​sectors, you can have a look at the ashift val­ues of the pool (the ashift is per vdev, so if you e.g. con­cat­ten­ate sev­eral mir­rors, the ashift needs to be veri­fied for each mir­ror, and if you con­cat­ten­ate just a bunch of disks, the ashift needs to be veri­fied for all disks). It needs to be 12. To get the ashift value you can use zdb:

zdb rpool | grep ashift

Set­ting up the root pool

One of the be­ne­fits of root-​on-​zfs is that I can have mul­tiple FreeBSD boot en­vir­on­ments (BE). This means that I not only can have sev­eral dif­fer­ent ker­nels, but also sev­eral dif­fer­ent user­land ver­sions. To handle them com­fort­ably, I use man­ageBE from Phil­ipp Wuensche. This re­quires a spe­cific setup of the root pool:

zfs cre­ate rpool/​ROOT
zfs cre­ate rpool/​ROOT/​r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/​ROOT/​r220832M   # man­ageBE set­ting

The r220832M is my ini­tial BE. I use the SVN re­vi­sion of the source tree which was used dur­ing in­stall of this BE as the name of the BE here. You also need to add the fol­low­ing line to /boot/loader.conf:


As I want to have a shared /​var and /​tmp for all my BEs, I cre­ate them sep­ar­ately:

zfs cre­ate –o exec=off –o setuid=off –o mountpoint=/rpool/ROOT/r220832M/var rpool/​var
zfs cre­ate –o setuid=off –o mountpoint=/rpool/ROOT/r220832M/tmp rpool/​tmp

As I did this on the old sys­tem, I did not set the moun­t­points to /​var and /​tmp, but this has to be done later.

Now the user­land can be in­stalled (e.g. buildworld/​installworld/​buildkernel/​buildkernel/​mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not for­get to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an empty /​etc/​fstab is in­side, and con­fig­ure the root as fol­lows (only show­ing what is ne­ces­sary for root-​on-​zfs):


rc.conf —snip— zfs_enable=“YES“ —snip—

At this point of the setup I un­moun­ted all zfs on rpool, set the moun­t­point of rpool/​var to /​var and of rpool/​tmp to /​tmp, ex­por­ted the pool and in­stalled the hard­disk in the new sys­tem. After boot­ing a life-​USB-​image, im­port­ing the pool, put­ting the res­ult­ing zpool.cache into the pool (rpool/​ROOT/​r220832M/​boot/​zfs/​), I re­booted into the rpool and at­tached the other hard­disks to the pool (“zpool at­tach rpool ada0p2 ada1p2”, “zpool at­tach rpool ada0p2 ada2p2”):

After up­dat­ing to a more re­cent ver­sion of 9-​current, the BE looks like this now:

# ./​bin/​manageBE list
Pool­name: rpool
BE                Act­ive Act­ive Moun­t­point           Space
Name              Now    Re­boot –                    Used
—-              —— —— — —  — -           —–
r221295M          yes    yes    /​                    2.66G
can­not open „-“: data­set does not ex­ist
r221295M@r221295M no     no     – r220832M          no     no     /​rpool/​ROOT/​r220832M  561M 

Used by BE snap­shots: 561M

The little bug above (the er­ror mes­sage which is prob­ably caused by the snap­shot which shows up here prob­ably be­cause I use listsnapshots=on) is already re­por­ted to the au­thor of man­ageBE.

StumbleUponXINGBalatarinBox.netDiggGoogle GmailNetvouzPlurkSiteJotTypePad PostYahoo BookmarksVKSlashdotPocketHacker NewsDiigoBuddyMarksRedditLinkedInBibSonomyBufferEmailHatenaLiveJournalNewsVinePrintViadeoYahoo MailAIMBitty BrowserCare2 NewsEvernoteMail.RuPrintFriendlyWaneloYahoo MessengerYoolinkWebnewsStumpediaProtopage BookmarksOdnoklassnikiMendeleyInstapaperFarkCiteULikeBlinklistAOL MailTwitterGoogle+PinterestTumblrAmazon Wish ListBlogMarksDZoneDeliciousFlipboardFolkdJamespotMeneameMixiOknotiziePushaSvejoSymbaloo FeedsWhatsAppYouMobdiHITTWordPressRediff MyPageOutlook.comMySpaceDesign FloatBlogger PostApp.netDiary.RuKindle ItNUjijSegnaloTuentiWykopTwiddlaSina WeiboPinboardNetlogLineGoogle BookmarksDiasporaBookmarks.frBaiduFacebookGoogle ClassroomKakaoQzoneSMSTelegramRenrenKnownYummlyShare/​Save

13 thoughts on “An­other root-​on-​zfs HOWTO (op­tim­ized for 4k-​sector drives)”

  1. Hi, I think you have an er­ror in your lo­gic. You men­tion that you check that ashift for the pool is set to 12, but ashift is not a pool wide set­ting. That is you can cre­ate a pool with a single vdev ini­tially and set ashift 12, but you can then add ad­di­tional vdevs later and they will only have ashift 12 if they are de­tec­ted to have 4k sec­tors (ie de­fault would be ashift 9).

    cheers Andy.

    1. I up­dated the ashift info in the art­icle to not only cover the use-​case as presen­ted here, to make it more use­ful for the gen­eric case.

  2. Hi, I just in­stalled ZFS root mir­ror with Cur­rent with 4k sec­tor drives be­fore I saw your art­icle. Be­fore I tear down my setup is the per­form­ance much im­proved over 512 sec­tor sizes. Thanks for your post­ing. Great info
    out­put from diskinfo:/dev/ada0
    512 # sec­tor­size
    2000398934016 # me­di­as­ize in bytes (1.8T)
    3907029168 # me­di­as­ize in sec­tors
    4096 # stripes­ize
    0 # stripeoff­set
    3876021 # Cyl­in­ders ac­cord­ing to firm­ware.
    16 # Heads ac­cord­ing to firm­ware.
    63 # Sec­tors ac­cord­ing to firm­ware.
    5YD2LLPS # Disk ident.

    1. First check if your par­ti­tions are already aligned (manual cal­cu­la­tion), and if the pool is us­ing 4k sec­tors (see the zdb com­mand in the art­icle). From the out­put of diskinfo I have the im­pres­sion you run a re­cent –cur­rent which already should do the right thing at least in ZFS (for gpart you may need to spe­cify the new –a op­tion to align a par­ti­tion cor­rectly).

      And yes, if you really have a 4k-​sector drive, there is a big speed diff­fer­ence between aligned and un­aligned (and 4k-​sectors in ZFS or not).

  3. Thanks for the info.
    I am us­ing CURRENT as of June 12th.

    Un­for­tu­nately, I get 9 when I run zdb.
    zdb zroot | grep ashift
    ashift: 9
    ashift: 9
    I was won­der­ing can I use gpart to res­ize the disk with the –a op­tion to cor­rect any prob­lems without re­in­stalling? Like get into single user mode drop de­tach filest­stem, gpart res­ize –a , and then mount filesys­tem again. Would this work?

    1. Align­ing a par­ti­tion means mov­ing the data in­side to a dif­fer­ent place. I am not aware that gpart is able to move the data of a par­ti­tion.

  4. it just re­boots… To my un­der­stand­ing, pmbr should some­how call gptzfs­boot which should find zpool.cache which con­tains the res­ult of “zpool set bootfs=/rpool rpool” mounts rpool and then starts /​boot/​kernel?! but it im­me­di­ately re­boots whithout mes­sage, so what does not work? O dear god I-​m such a noob.… En­lighten me with your in­sight, mas­ter.

    1. Your de­scrip­tion does not con­tain enough in­fos to be able to help. Can you please de­scribe on fs@​FreeBSD.​org what you did and in which or­der? There are more people (with more time than I have), which should be able to help.

  5. I see you set the size of the boot par­ti­tion to 512 kB (1024 blocks). You should be aware that the boot code ac­tu­ally loads the en­tire par­ti­tion, so you want to keep it as small as pos­sible. There’s not much point in align­ing it, either, since it’s only read once, at boot time, and never writ­ten to once after in­stall­a­tion.
    If you start your boot par­ti­tion at off­set 34 (the first avail­able block on a GPT disk) and give it 94 blocks (47 kB), the next par­ti­tion will start at off­set 128, which is a nice round num­ber. If you ab­so­lutely must align the boot par­ti­tion, you can place it at off­set 36 with a length of 92 blocks (46 kB). The GPT ZFS second-​stage loader, gptzfs­boot, is only about 30 kB, so 92 blocks is plenty, even al­low­ing for a reas­on­able amount of fu­ture bloat.

  6. HEAD gptzfs­boot was ap­par­ently broken. I in­stalled the one from a 9-​CURRENT USB Im­age, and now everything works as ex­pec­ted. Great Per­form­ance. Thank you so much for your com­pet­ent tu­torial.

  7. Hi Alex,
    I’m in the pro­cess of re­build­ing my NAS and have up­graded the disks to new WD Caviar Green with Adv. Format. One thing which I haven’t been able find any­where is the as­sur­ance that when us­ing gpart –b to spe­cify the start­ing sec­tor that the util ac­tu­ally does the (sector)-1 for drives where the LBA starts at zero and not 1 (as is the case for most drives).
    I’m as­sum­ing from your walk through that this is the case and that I can stop worry about the 4k sec­tor align­ment by spe­cify­ing the start­ing sec­tor in ab­so­lute terms and not the LBA it­self?

  8. Gavin, I did not take into ac­count the dif­fer­ence between drives which start at 0 resp. 1. I sug­gest to use des“ tool (can’t re­mem­ber the name of it) which tests the align­ment of those drives to be sure.

Leave a Reply

Your email address will not be published. Required fields are marked *