Alexander Leidinger

Just another weblog

Sep
01

HOWTO add linux-infrastructure ports for a new linux_base port

In my last blog-post I described how to cre­ate a new linux_base port. This blog-post is about the other Linux–ports which make up the Linux–infra­struc­ture in the FreeBSD Ports Col­lec­tion for a given Linux-release.

What are linux-infrastructure ports?

A linux_base port con­tains as much as pos­si­ble and at the same time as lit­tle as pos­si­ble to make up a use­ful Linux-compatibility-experience in FreeBSD. I know, this is not a descrip­tive expla­na­tion. And it is not on pur­pose. There are no fixed rules what has to be inside or what not. It “matured” into the cur­rent shape. A prac­ti­cal exam­ple is, that there is no GUI-stuff in the linux_base. While you need the GUI parts like GTK or QT for soft­ware like Skype and acroread, you do not need them for head­less game servers. While you may need var­i­ous libraries for game servers, you may not need those for Skype or acroread. As such some stan­dard parts are in sep­a­rate ports which are named linux–LINUX_DIST_SUFFIX-NAME. For GTK and the Fedora 10 release this results in linux-f10-gtk2. Such generic ports which depend upon a spe­cific Linux-release make up the Linux-infrastructure in the FreeBSD Ports Col­lec­tion. Those ports are ref­er­enced in port-Makefiles via the USE_LINUX_APPS vari­able, e.g. USE_LINUX_APPS=gtk2.

If you cre­ated a new linux_base port, you need most stan­dard infra­struc­ture ports in a ver­sion for the Linux-release used in the linux_base port, to have the Linux-application ports in the FreeBSD Ports Col­lec­tion work­ing (if you are unlucky, some ports do not play well with the Linux-release you have cho­sen, but this is out of the scope of this HOWTO).

Updat­ing Mk/bsd.linux-apps.mk

 First we need to set the LINUX_DIST_SUFFIX vari­able to a value suit­able to the new Linux-release. This is done in the con­di­tional which checks the OVERRIDE_LINUX_NONBASE_PORTS vari­able for valid val­ues. Add an appro­pri­ate con­di­tional, and do not for­get to add the new valid value to the IGNORE line in the last else branch of the conditional.

The next step is to check the _LINUX_APPS_ALL and _LINUX_26_APPS vari­ables. If there are some infra­struc­ture ports which are not avail­able for the new Linux-release, the con­di­tional which checks the avail­abil­ity of a given infra­struc­ture port for a given Linux-release needs to be mod­i­fied. If at a later step you notice that there are some addi­tional infra­struc­ture ports nec­es­sary for the new Linux-release, _LINUX_APPS_ALL and the check-logic needs to be mod­i­fied too (e.g. add a new vari­able for your Linux-release, add the con­tent of the vari­able to _LINUX_APPS_ALL, and change the check to do the right thing).

After that two tedious parts need to be done.

For each infra­struc­ture port there is a set of vari­ables. The name_PORT vari­able con­tains the loca­tion of the port in the Ports Col­lec­tion. Typ­i­cally you do not have to change it (if you really want to change it, do not do it, fix the nam­ing of the infra­struc­ture port instead), because we use a nam­ing con­ven­tion here which includes the LINUX_DIST_SUFFIX. The name_DETECT vari­able is an inter­nal vari­able, do not change it (if you cre­ate a new infra­struc­ture port, copy it from some­where else and make sure the name in value of the vari­able matches the port name in the name of the vari­able). Then there are sev­eral name_suf­fix_FILE vari­ables. Leave the exist­ing ones alone, and add a new one with the cor­rect suf­fix for your new Linux-release. The value of the vari­able needs to be an impor­tant file which is installed by the infra­struc­ture port in ques­tion. FYI: The con­tent of the name_suf­fix_FILE vari­ables are used to set the name_DETECT vari­ables, depend­ing on the Linux-relase the name_DETECT vari­ables are used to check if the port is already installed. Ide­ally the name_suf­fix_FILE vari­able points to a library in the port. The name_DEPENDS vari­able lists depen­den­cies of this infra­struc­ture port. If the depen­den­cies changed in your Linux-release, you need to add a con­di­tional to change the depen­dency if LINUX_DIST_SUFFIX is set to your Linux-release.

Nor­mally this is all what needs to be done in PORTSDIR/Mk/bsd.linux-apps.mk, the rest of the file is code to check depen­den­cies and some cor­rect­ness checks.

The sec­ond tedious part is to actu­ally cre­ate all those infra­struc­ture ports. Nor­mally you can copy an exist­ing infra­struc­ture port, rename it, adjust the PORTNAME, PORTVERSION, PORTREVISION, MASTER_SITES, PKGNAMEPREFIX, DISTFILES, CONFLICTS (also in all other Linux-release ver­sions of this infra­struc­ture port), LINUX_DIST_VER, RPMVERSION (if set/neccesary) and SRC_DISTFILE vari­ables, gen­er­ate the dis­t­file check­sums (make make­sum), and fix the plist. I sug­gest to script parts of this work (as of this writ­ing Fresh­ports counts 68 ports where the port­name starts with linux-f10-).

Adding new infra­struc­ture ports, or remov­ing infra­struc­ture ports for a given Linux-release

If your Linux-release does not come with a pack­age for an exist­ing infra­struc­ture port, just do not cre­ate a cor­re­spond­ing name_suf­fix_FILE line. You still need to do the right thing regard­ing depen­den­cies of ports which depend upon this non-existing infra­struc­ture port (if your Linux-release comes with pack­ages for them).

To add a new infra­struc­ture port, copy an exist­ing block, rename the vari­ables, set them cor­rectly, add a new vari­able for your Linux-release in the first _LINUX_APPS_ALL sec­tion, add the con­tent of this vari­able to _LINUX_APPS_ALL, and change the check-logic as described above.

Final words

If you have some­thing which installs and dein­stalls cor­rectly, feel free to pro­vide it on freebsd-emulation@FreeBSD.org for review/testing. If you have ques­tions dur­ing the port­ing, feel also free to send a mail there.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share

Tags: , , , , , , , , ,
Aug
29

HOWTO cre­ate a new linux_base port

FreeBSD is in need of a new linux_base port. It is on my TODO list since a long time, but I do not get the time to cre­ate one. I still do not have the time to work on a new one, but when you read this, I man­aged to get the time to cre­ate a HOWTO which describes what needs to be done to cre­ate a new linux_base port.

I will not describe how to cre­ate a new linux_base port from scratch, I will just describe how you can copy the last one and update it to some­thing newer based upon the exist­ing infra­struc­ture for RPM packages.

Spe­cific ques­tions which come up dur­ing port­ing a new Linux release should be asked on freebsd-emulation@FreeBSD.org,  there are more peo­ple which can answer ques­tions than here in my blog. I will add use­ful infor­ma­tion to this HOWTO if necessary.

In the easy case most of the work is search­ing the right RPMs and their depen­den­cies to use, and to cre­ate the plist.

Why do we need a new linux_base port?

The cur­rent linux_base port is based upon Fedora 10, which is end of life since Decem­ber 2009. Even Fedora 13 is already end of life. Fedora 16 is sup­posed to be released this year. From a sup­port point of view, Fedora 15 or maybe even Fedora 16 would be a good tar­get for the next linux_base port. Other alter­na­tives would be to use an extended life­time release of another RPM based dis­tri­b­u­tion, like for exam­ple Cen­tOS 6 (which seems to be based upon Fedora 12 with back­ports from Fedora 13 and 14). Using a Linux release which is told to be sup­ported for at least 10 years, sounds nice from a FreeBSD point of view (only minor changes to the linux ports in such a case, instead of cre­at­ing a com­plete new linux_base each N+2 releases like with Fedora), but it also means addi­tional work if you want to cre­ate the first linux_base port for it.

The mys­ter­ies you have to con­quer if you want to cre­ate a new linux_base port

What we do not know is, if Fedora 15/16, Cen­tOS 6, or any other Linux release will work in a sup­ported FreeBSD release. There are two ways to find this out.

The first one is to take an exist­ing Linux sys­tem, chroot into it (either via NFS or after mak­ing a copy into a direc­tory of a FreeBSD sys­tem), and to run a lot of pro­grams (acroread, skype, shells, scripts, …). The LTP test­suite is not that much use­ful here, as it will test mostly ker­nel fea­tures, but we do not know which ker­nel fea­tures are manda­tory for a given user­land of a Linux release.

The sec­ond way of test­ing if a given Linux release works on FreeBSD is to actu­ally cre­ate a new linux_base port for it and test it with­out chrooting.

The first way is faster, if you are only inter­ested in test­ing if some­thing works. The sec­ond way pro­vides an easy to setup test­bed for FreeBSD ker­nel devel­op­ers to fix the Lin­ux­u­la­tor so that it works with the new linux_base port. Both ways have their mer­its, but it is up to the per­son doing the work to decide which way to go.

The meat: HOWTO cre­ate a new linux_base port

First off, you need a sys­tem (or a jail) with­out any linux_base port installed. After that you can cre­ate a new linux_base port (= lbN), by just mak­ing a copy of the lat­est one (= lbO). In lbN you need to add lbO as a CONFLICT, and in all other exist­ing linux_base ports, you need to add lbN as a conflict.

Change the PORTNAME, PORTVERSION, reset the PORTREVISION in lbN, and set LINUX_DIST_VER  to the new Linux-release ver­sion in the lbN Make­file (this is used in PORTSDIR/Mk/bsd.linux-rpm.mk and PORTSDIR/Mk/bsd.linux-apps.mk).

If you do not stay with Fedora, there is some more work to do before you can have a look at chos­ing RPMs for instal­la­tion. You need to have a look at PORTSDIR/Mk/bsd.linux-rpm.mk and add some cases for the new LINUX_DIST you want to use. Do not for­get to set LINUX_DIST in the lbN Make­file to the name of the dis­tri­b­u­tion you use. You also need to aug­ment the LINUX_DIST_VER check in PORTSDIR/Mk/bsd.linux-rpm.mk with some LINUX_DIST con­di­tion­als. If you are lucky, the direc­tory struc­ture for down­loads is sim­i­lar to the Fedora struc­ture, and there is not a lot to do here.

When this is done, you can have a look at the BIN_DISTFILES vari­able in the lbN Make­file. Try to find sim­i­lar RPMs for the new Linux release you want to port. Some may not be avail­able, and it may also be the case that dif­fer­ent ones are needed instead. I sug­gest to first work with the ones which are avail­able (make make­sum, test install and cre­ate plist). After that you need to find out what the replace­ment RPMs for non-existing ones are. You are on your own here. Search around the net, and/or have a look at the depen­den­cies in the RPMs of lbO to deter­mine if some­thing was added as a depen­dency of some­thing else or not (if not, for­get about it ATM). When you man­aged to find replace­ment RPMs, you can now have a look at the depen­den­cies of the RPMs in lbN. Do not add blindly all depen­den­cies, not all are needed in FreeBSD (the linux_base ports are not sup­posed to cre­ate an envi­ron­ment which you can chroot into, they are sup­posed to aug­ment the FreeBSD sys­tem to be able to run Linux pro­grams in ports like they where FreeBSD native pro­grams). What you need in the linux_base ports are libraries, con­fig and data files which do not exist in FreeBSD or have a dif­fer­ent syn­tax than in FreeBSD (those con­fig or data files which are just in a dif­fer­ent place, can be sym­linked), and basic shell com­mands (which com­mands are needed or not… well… good ques­tion, in the past we made deci­sions what to include based upon prob­lem reports from users). Now for the things which are not avail­able and where not added as a depen­dency. Those are things which are either used dur­ing install, or where use­ful to have in the past. Find out by what it was replaced and have a look if this replace­ment can eas­ily be used instead. If it can be used, add it. If not, well… bad luck, we (the FreeBSD com­mu­nity) will see how to han­dle this somehow.

If you think that you have all you need in BIN_DISTFILES, please update SRC_DISTFILES accord­ingly and gen­er­ate the dis­t­file via  make –DPACKAGE_BUILDING make­sum to have the check­sums of the sources (for legal rea­sons we need them on our mirrors).

The next step is to have a look at REMOVE_DIRS, REMOVE_FILES and ADD_DIRS if some­thing needs to be mod­i­fied. Most of them are there to fall back to the cor­re­spond­ing FreeBSD directories/files, or because they are not needed at all (REMOVE_*). Do not remove direc­to­ries from ADD_DIRS, they are cre­ated here to fix some edge con­di­tions (I do not remem­ber exactly why we had to add them, and I do not take the time ATM to search in the CVS history).

If you are lucky, this is all (make sure the plist is cor­rect). If you are not lucky and you need to make some mod­i­fi­ca­tions to files, have a look at the do-build tar­get in the Make­file, this is the place where some changes are done to cre­ate a nice user experience.

If you arrive here while cre­at­ing a new linux_base port, lean back and feel a bit proud. You man­aged to cre­ate a new linux_base port. It is not very well tested at this moment, and it is far from every­thing which needs to be done to have the com­plete Linux infra­struc­ture for a given Linux release, but the most impor­tant part is done. Please notify freebsd-emulation@FreeBSD.org and call for testers.

What is missing?

The full Lin­ux­u­la­tor infra­struc­ture for the FreeBSD Ports Col­lec­tion has some more ports around a linux_base port. Most of the infra­struc­ture for this is han­dled in Mk/bsd.linux-apps.mk.

UPDATE: I got some time to write how to update the Linux-infrastructure ports.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share

Tags: , , , , , , , , ,
May
19

How I setup a Jail-Host

Every­one has his own way of set­ting up a machine to serve as a host of mul­ti­ple jails. Here is my way, YMMV.

Ini­tial FreeBSD install

I use sev­eral hard­disks in a Soft­wareRAID setup. It does not mat­ter much if you set them up with one big par­ti­tion or with sev­eral par­ti­tions, feel free to fol­low your pref­er­ences here. My way of par­ti­tion­ing the hard­disks is described in a pre­vi­ous post. That post only shows the com­mands to split the hard­disks into two par­ti­tions and use ZFS for the rootfs. The com­mands to ini­tial­ize the ZFS data par­ti­tion are not described, but you should be able to fig­ure it out your­self (and you can decide on your own what kind of RAID level you want to use). For this FS I set atime, exec and setuid to off in the ZFS options.

On the ZFS data par­ti­tion I cre­ate a new dataset for the sys­tem. For this dataset I set atime, exec and setuid to off in the ZFS options. Inside this dataset I cre­ate datasets for /home, /usr/compat, /usr/local, /usr/obj, /usr/ports/, /usr/src, /usr/sup and /var/ports. There are two ways of doing this. One way is to set the ZFS mount­point. The way I pre­fer is to set rel­a­tive sym­links to it, e.g. “cd /usr; ln –s ../data/system/usr_obj obj”. I do this because this way I can tem­po­rary import the pool on another machine (e.g. my desk­top, if the need arises) with­out fear to inter­fere with the sys­tem. The ZFS options are set as follows:

ZFS options for data/system/*

Dataset

Option

Value
data/system/home exec on
data/system/usr_compat exec on
data/system/usr_compat setuid on
data/system/usr_local exec on
data/system/usr_local setuid on
data/system/usr_obj exec on
data/system/usr_ports exec on
data/system/usr_ports setuid on
data/system/usr_src exec on
data/system/usr_sup sec­ondarycache none
data/system/var_ports exec on

The exec option for home is not nec­es­sary if you keep sep­a­rate datasets for each user. Nor­mally I keep sep­a­rate datasets for home direc­to­ries, but Jail-Hosts should not have users (except the admins, but they should not keep data in their homes), so I just cre­ate a sin­gle home dataset. The setuid option for the usr_ports should not be nec­es­sary if you redi­rect the build direc­tory of the ports to a dif­fer­ent place (WRKDIRPREFIX in /etc/make.conf).

Installing ports

The ports I install by default are net/rsync, ports-mgmt/portaudit, ports-mgmt/portmaster, shells/zsh, sysutils/bsdstats, sysutils/ezjail, sysutils/smartmontools and sysutils/tmux.

Basic setup

In the crontab of root I setup a job to do a port­snap update once a day (I pick a ran­dom num­ber between 0 and 59 for the minute, but keep a fixed hour). I also have http_proxy spec­i­fied in /etc/profile, so that all machines in this net­work do not down­load every­thing from far away again and again, but can get the data from the local caching proxy. As a lit­tle watch­dog I have a lit­tle @reboot rule in the crontab, which noti­fies me when a machine reboots:

@reboot grep "kernel boot file is" /var/log/messages | mail -s "`hostname` rebooted" root >/dev/null 2>&1

This does not replace a real mon­i­tor­ing solu­tion, but in cases where real mon­i­tor­ing is overkill it pro­vides a nice HEADS-UP (and shows you directly which ker­nel is loaded in case a non-default one is used).

Some default aliases I use every­where are:

alias portmlist="portmaster -L | egrep -B1 '(ew|ort) version|Aborting|installed|dependencies|IGNORE|marked|Reason:|MOVED|deleted|exist|update' | grep -v '^--'"
alias portmclean="portmaster -t --clean-distfiles --clean-packages"
alias portmcheck="portmaster -y --check-depends"

Addi­tional devfs rules for Jails

I have the need to give access to some spe­cific devices in some jails. For this I need to setup a cus­tom /etc/devfs.rules file. The files con­tains some ID num­bers which need to be unique in the sys­tem. On a 9–cur­rent sys­tem the num­bers one to four are already used (see /etc/defaults/devfs.rules). The next avail­able num­ber is obvi­ously five then. First I present my devfs.rules entries, then I explain them:

[devfsrules_unhide_audio=5]
add path 'audio*' unhide
add path 'dsp*' unhide
add path midistat unhide
add path 'mixer*' unhide
add path 'music*' unhide
add path 'sequencer*' unhide
add path sndstat unhide
add path speaker unhide

[devfsrules_unhide_printers=6]
add path 'lpt*' unhide
add path 'ulpt*' unhide user 193 group 193
add path 'unlpt*' unhide user 193 group 193

[devfsrules_unhide_zfs=7]
add path zfs unhide

[devfsrules_jail_printserver=8]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_printers
add include $devfsrules_unhide_zfs

[devfsrules_jail_withzfs=9]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_unhide_zfs

The devfs_rules_unhide_XXX ones give access to spe­cific devices, e.g. all the sound related devices or to local print­ers. The devfsrules_jail_XXX ones com­bine all the unhide rules for spe­cific jail setups. Unfor­tu­nately the include direc­tive is not recur­sive, so that we can not include the default devfsrules_jail pro­file and need to repli­cate its con­tents. The first three includes of each devfsrules_jail_XXX accom­plish this. The unhide_zfs rule gives access to /dev/zfs, which is needed if you attach one or more ZFS datasets to a jail. I will explain how to use those pro­files with ezjail in a follow-up post.

Jails setup

I use ezjail to man­age jails, it is more com­fort­able than doing it by hand while at the same time allows me to do some­thing by hand. My jails nor­mally reside inside ZFS datasets, for this rea­son I have setup a spe­cial area (ZFS dataset data/jails) which is han­dled by ezjail.The cor­re­spond­ing ezjail.conf set­tings are:

ezjail_jaildir=/data/jails
ezjail_use_zfs="YES"
ezjail_jailzfs="data/jails"

I also dis­abled procfs and fde­scfs in jails (but they can be enabled later for spe­cific jails if necessary).

Unfor­tu­nately ezjail (as of v3.1) sets the mount­point of a newly cre­ated dataset even if it is not nec­es­sary. For this rea­son I always issue a “zfs inherit mount­point ” after cre­at­ing a jail. This sim­pli­fies the case where you want to move/rename a dataset and want to have the mount­point autom­cat­i­cally fol­low the change.

The access flags of  /data/jails direc­tory are 700, this pre­vents local users (there should be none, but bet­ter safe than sorry) to get access to files from users in jails with the same UID.

After the first create/update of the ezjail base­jail the ZFS options of base­jail (data/jails/basejail) and new­jail (data/jails/newjail) need to be changed. For both exec and setuid should be changed to “on” The same needs to be done after cre­at­ing a new jail for the new jail (before start­ing it).

The default ezjail flavour

In my default ezjail flavour I cre­ate some default user(s) with a basesystem-shell (via /data/jails/flavours/mydef/ezjail.flavour) before the pack­age install, and change the shell to my pre­ferred zsh after­wards (this is only valid if the jails are used only by in-house peo­ple, if you want to offer light­weight vir­tual machines to (unknown) cus­tomers, the default user(s) and shell(s) are obvi­ously up to dis­cus­sion). At the end I also run a “/usr/local/sbin/portmaster –y –check-depends” to make sure every­thing is in a sane state.

For the pack­ages (/data/jails/flavours/mydef/pkg/) I add sym­links to the unver­sioned pack­ages I want to install. I have the pack­ages in a com­mon (think about set­ting PACKAGES in make.conf and using PACKAGES/Latest/XYZ.tbz) direc­tory (if they can be shared over var­i­ous flavours), and they are unver­sioned so that I do not have to update the ver­sion num­ber each time there is an update. The pack­ages I install by default are bsd­stats, por­tau­dit, port­mas­ter, zsh, tmux and all their dependencies.

In case you use jails to vir­tu­al­ize ser­vices and con­sol­i­date servers (e.g. DNS, HTTP, MySQL each in a sep­a­rate jail) instead of pro­vid­ing light­weight vir­tual machines to (unknown) cus­tomers, there is also a ben­e­fit of shar­ing the dis­t­files and pack­ages between jails on the same machine. To do this I cre­ate /data/jails/flavours/mydef/shared/ports/{distfiles,packages} which are then mounted via nullfs or NFS into all the jails from a com­mon direc­tory. This requires the fol­low­ing vari­ables in /data/jails/flavours/mydef/etc/make.conf (I also keep the pack­ages for dif­fer­ent CPU types and com­pil­ers in the same sub­tree, if you do not care, just remove the “/${CC}/${CPUTYPE}” from the PACAKGES line):

DISTDIR=  /shared/ports/distfiles
PACKAGES= /shared/ports/packages/${CC}/${CPUTYPE}

New jails

A future post will cover how I setup new jails in such a setup and how I cus­tomize the start order of jails or use some non–default set­tings for the jail-startup.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share

Tags: , , , , , , , , ,
May
03

Another root-on-zfs HOWTO (opti­mized for 4k-sector drives)

After 9 years with my cur­rent home-server (one jail for each ser­vice like MySQL, Squid, IMAP, Web­mail, …) I decided that it is time to get some­thing more recent (spe­cially as I want to install some more jails but can not add more mem­ory to this i386 system).

With my old sys­tem I had an UFS2-root on a 3-way-gmirror, swap on a 2-way-gmirror and my data in a 3-partition raidz (all in dif­fer­ent slices of the same 3 hard­disks, the 3rd slice which would cor­re­spond to the swap was used as a crash­dump area).

For the new sys­tem I wanted to go all-ZFS, but I like to have my boot area sep­a­rated from my data area (two pools instead of one big pool). As the machine has 12 GB RAM I also do not con­fig­ure swap areas (at least by default, if I really need some swap I can add some later, see below). The sys­tem has five 1 TB hard­disks and a 60 GB SSD. The hard­disks do not have 4k-sectors, but I expect that there will be more and more 4k-sector dri­ves in the future. As I pre­fer to plan ahead I installed the ZFS pools in a way that they are “4k-ready”. For those which have 4k-sector dri­ves which do not tell the truth but announce they have 512 byte sec­tors (I will call them pseudo-4k-sector dri­ves here) I include a descrip­tion how to prop­erly align the (GPT-)partitions.

A major require­ment to boot 4k-sector-size ZFS pools is ZFS v28 (to be cor­rect here, just the boot-code needs to sup­port this, so if you take the pmbr and gptzfs­boot from a ZFS v28 sys­tem, this should work… but I have not tested this). As I am run­ning 9-current, this is not an issue for me.

A quick descrip­tion of the task is to align the partition/slices prop­erly for pseudo-4k-sector dri­ves, and then use gnop tem­po­rary dur­ing pool cre­ation time to have ZFS use 4k-sectors dur­ing the life­time of the pool. The long descrip­tion follows.

The lay­out of the drives

The five equal dri­ves are par­ti­tioned with a GUID par­ti­tion table (GPT). Each drive is divided into three par­ti­tions, one for the boot code, one for the root pool, and one for the data pool. The root pool is a 3-way mir­ror and the data pool is a raidz2 pool over all 5 disks. The remain­ing space on the two hard­disks which do not take part in the mir­ror­ing of the root pool get swap par­ti­tions of the same size as the root par­ti­tions. One of them is used as a dumpde­vice (this is –cur­rent, after all), and the other one stays unused as a cold-standby. The 60 GB SSD will be used as a ZFS cache device, but as of this writ­ing I have not decided yet if I will use it for both pools or only for the data pool.

Cal­cu­lat­ing the offsets

The first sec­tor after the GPT (cre­ated with stan­dard set­tings) which can be used as the first sec­tor for a par­ti­tion is sec­tor 34 on a 512 bytes-per-sector drive. On a pseudo-4k-sector drive this would be some­where in the sec­tor 4 of a real 4k-sector, so this is not a good start­ing point. The next 4k-aligned sec­tor on a pseudo-4k-sector drive is sec­tor 40 (sec­tor 5 on a real 4k-sector drive).

The first par­ti­tion is the par­ti­tion for the FreeBSD boot code. It needs to have enough space for gptzfs­boot. Only allo­cat­ing the space needed for gptzfs­boot looks a lit­tle bit dan­ger­ous regard­ing future updates, so my hard­disks are con­fig­ured to allo­cate half a megabyte for it. Addi­tion­ally I leave some unused sec­tors as a safety mar­gin after this first partition.

The sec­ond par­ti­tion is the root pool (respec­tively the swap par­ti­tions). I let it start at sec­tor 2048, which would be sec­tor 256 on a real 4k-sector drive (if you do not want to waste less than half a megabyte just cal­cu­late a lower start sec­tor which is divis­i­ble by 8 (-> start % 8 = 0)). It is a 4 GB par­ti­tion, this is enough for the basesys­tem with some debug ker­nels. Every­thing else (/usr/{src,ports,obj,local}) will be in the data partition.

The last par­ti­tion is directly after the sec­ond and uses the rest of the hard­disk rounded down to a full GB (if the disk needs to be replaced with a sim­i­lar sized disk there is some safety mar­gin left, as the num­ber of sec­tors in hard­disks fluc­tu­ates a lit­tle bit even in the same mod­els from the same man­u­fac­tur­ing charge). For my hard­disks this means a lit­tle bit more than half a giga­byte of wasted stor­age space.

The com­mands to par­ti­tion the disks

In the fol­low­ing I use ada0 as the device of the disk, but it also works with daX or adX or sim­i­lar. I installed one disk from an exist­ing 9–cur­rent sys­tem instead of using some kind of instal­la­tion media (beware, the pool is linked to the sys­tem which cre­ates it, I booted a life-USB image to import it on the new sys­tem and copied the zpool.cache to /boot/zfs/ after import­ing on the new system).

Cre­ate the GPT:

gpart create -s gpt ada0

Cre­ate the boot partition:

gpart add -b 40 -s 1024 -t freebsd-boot ada0

Cre­ate the root/swap par­ti­tions and name them with a GPT label:

gpart add -b 2048 -s 4G -t freebsd-zfs -l rpool0 ada0

or for the swap

gpart add -b 2048 -s 4G -t freebsd-swap -l swap0 ada0

Cre­ate the data par­ti­tion and name them with a GPT label:

gpart add -s 927G -t freebsd-zfs -l data0 ada0

Install the boot code in par­ti­tion 1:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

The result looks like this:

# gpart show ada0
=>        34  1953525101  ada0  GPT  (931G)
          34           6        - free -  (3.0k)
          40        1024     1  freebsd-boot  (512k)
        1064         984        - free -  (492k)
        2048     8388608     2  freebsd-zfs  (4.0G)
     8390656  1944059904     3  freebsd-zfs  (927G)
  1952450560     1074575        - free -  (524M)

Cre­ate the pools with 4k-ready inter­nal structures

Cre­at­ing a ZFS pool on one of the ZFS par­ti­tions with­out prepa­ra­tion will not cre­ate a 4k-ready pool on a pseudo-4k-drive. I used gnop (the set­tings do not sur­vive a reboot) to make the par­ti­tion tem­po­rary a 4k-sector par­ti­tion (only the com­mand for the root pool is shown, for the data par­ti­tion gnop has to be used in the same way):

gnop create -S 4096 ada0p2
zpool create -O utf8only=on -o failmode=panic rpool ada0p2.nop
zpool export rpool
gnop destroy ada0p2.nop
zpool import rpool

After the pool is cre­ated, it will keep the 4k-sectors set­ting, even when accessed with­out gnop. You can ignore the options I used to cre­ate the pool, they are just my pref­er­ences (and the utf8only set­ting can only be done at pool cre­ation time). If you pre­pare this on a sys­tem which already has a zpool on its own, you can maybe spec­ify “-o cachefile=/boot/zfs/zpool2.cache” and copy it to the new pool as zpool.cache to make it bootable with­out the need of a life-image for the new sys­tem (I did not test this).

Ver­i­fy­ing if a pool is pseudo-4k-ready

To ver­ify that the pool will use 4k-sectors, you can have a look at the ashift val­ues of the pool (the ashift is per vdev, so if you e.g. con­cat­te­nate sev­eral mir­rors, the ashift needs to be ver­i­fied for each mir­ror, and if you con­cat­te­nate just a bunch of disks, the ashift needs to be ver­i­fied for all disks). It needs to be 12. To get the ashift value you can use zdb:

zdb rpool | grep ashift

Set­ting up the root pool

One of the ben­e­fits of root-on-zfs is that I can have mul­ti­ple FreeBSD boot envi­ron­ments (BE). This means that I not only can have sev­eral dif­fer­ent ker­nels, but also sev­eral dif­fer­ent user­land ver­sions. To han­dle them com­fort­ably, I use man­ageBE from Philipp Wuen­sche. This requires a spe­cific setup of the root pool:

zfs create rpool/ROOT
zfs create rpool/ROOT/r220832M
zpool set bootfs=rpool/ROOT/r220832M rpool
zfs set freebsd:boot-environment=1 rpool/ROOT/r220832M   # manageBE setting

The r220832M is my ini­tial BE. I use the SVN revi­sion of the source tree which was used dur­ing install of this BE as the name of the BE here. You also need to add the fol­low­ing line to /boot/loader.conf:

vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"

As I want to have a shared /var and /tmp for all my BEs, I cre­ate them separately:

zfs create -o exec=off -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/var rpool/var
zfs create -o setuid=off -o mountpoint=/rpool/ROOT/r220832M/tmp rpool/tmp

As I did this on the old sys­tem, I did not set the mount­points to /var and /tmp, but this has to be done later.

Now the user­land can be installed (e.g. buildworld/installworld/buildkernel/buildkernel/mergemaster with DESTDIR=/rpool/ROOT/r220832M/, do not for­get to put a good master.passwd/passwd/group in the root pool).

When the root pool is ready make sure an empty /etc/fstab is inside, and con­fig­ure the root as fol­lows (only show­ing what is nec­es­sary for root-on-zfs):

loader.conf:
---snip---
vfs.root.mountfrom="zfs:rpool/ROOT/r220832M"
zfs_load="yes"
opensolaris_load="yes"
---snip---

rc.conf
---snip---
zfs_enable="YES"
---snip---

At this point of the setup I unmounted all zfs on rpool, set the mount­point of rpool/var to /var and of rpool/tmp to /tmp, exported the pool and installed the hard­disk in the new sys­tem. After boot­ing a life-USB-image, import­ing the pool, putting the result­ing zpool.cache into the pool (rpool/ROOT/r220832M/boot/zfs/), I rebooted into the rpool and attached the other hard­disks to the pool (“zpool attach rpool ada0p2 ada1p2”, “zpool attach rpool ada0p2 ada2p2”):

After updat­ing to a more recent ver­sion of 9-current, the BE looks like this now:

# ./bin/manageBE list
Poolname: rpool
BE                Active Active Mountpoint           Space
Name              Now    Reboot -                    Used
----              ------ ------ ----------           -----
r221295M          yes    yes    /                    2.66G
cannot open '-': dataset does not exist
r221295M@r221295M no     no     -
r220832M          no     no     /rpool/ROOT/r220832M  561M

Used by BE snapshots: 561M

The lit­tle bug above (the error mes­sage which is prob­a­bly caused by the snap­shot which shows up here prob­a­bly because I use listsnapshots=on) is already reported to the author of manageBE.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share

Tags: , , , , , , , , ,
Apr
19

Solaris UFS full while df shows plenty of free space/inodes

At work we have a Solaris 8 with a UFS which told the appli­ca­tion that it can not cre­ate new files. The df com­mand showed plenty if free inodes, and there was also enough space free in the FS. The rea­son that the appli­ca­tion got the error was that while there where still plenty of frag­ments free, no free block was avail­able any­more. You can not cre­ate a new file only with frag­ments, you need to have at least one free block for each new file.

To see the num­ber of free blocks of a UFS you can call “fstyp –v | head –18″ and look at the value behind “nbfree”.

To get this work­ing again we cleaned up the FSlit­tle bit (compressing/deleting log files), but this is only a tem­po­rary solu­tion. Unluck­ily we can not move this appli­ca­tion to a Solaris 10 with ZFS, so I was play­ing around a lit­tle bit to see what we can do.

First I made a his­togram of the file sizes. The backup of the FS I was play­ing with had a lit­tle bit more than 4 mil­lion files in this FS. 28.5% of them where smaller than or equal 512 bytes, 31.7% where smaller than or equal 1k (frag­ment size), 36% smaller than or equal 8k (block size) and 74% smaller than or equal 16k. The fol­low­ing graph shows in red the crit­i­cal part, files which need a block and pro­duce frag­ments, but can not life with only fragments.

chart

Then I played around with newfs options for this one spe­cific FS with this spe­cific data mix. Chang­ing the num­ber of inodes did not change much the out­come for our prob­lem (as expected). Chang­ing the opti­miza­tion from “time” to “space” (and restor­ing all the data from backup into the empty FS) gave us 1000 more free blocks. On a FS which had 10 Mio free blocks when empty this is not much, but we expect that the restore con­sumes less frag­ments and more full blocks than the live-FS of the appli­ca­tion (we can not com­pare, as the con­tent of the live-FS changed a lot since we had the prob­lem). We assume that e.g. the logs of the appli­ca­tion are split over a lot of frag­ments instead of full blocks, due to small writes to the logs by the appli­ca­tion. The restore should write all the data in big chunks, so our expec­ta­tion is that the FS will use more full blocks and less frag­ments. Because of this we expect that the live-FS with this spe­cific data mix could ben­e­fit from chang­ing the optimization.

I also played around with the frag­ment size. The expec­ta­tion was that it will only change what is reported in the out­put of df (reduc­ing the reported avail­able space for the same amount of data). Here is the result:

chart

The dif­fer­ence between 1k (default) and 2k is not much. For 8k we would have to much unused space lost. The frag­ment size of 4k looks like it is accept­able to get a bet­ter mon­i­tor­ing sta­tus of this par­tic­u­lar data mix.

Based upon this we will prob­a­bly cre­ate a new FS with a frag­ment size of 4k and we will prob­a­bly switch the opti­miza­tion directly to “space”. This way we will have a bet­ter report­ing on the fill level of the FS for our data mix (but we will not be able to fully use the real space of the FS) and as such our mon­i­tor­ing should alert us in time to do a cleanup of the FS or to increase the size of the FS.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share

Tags: , , , , , , , , ,