FreeB­SD Ser­vice Jails – anoth­er lay­er in the secu­ri­ty onion

In May I com­mit­ted a new fea­ture to FreeBSD-current (it will be in the FreeB­SD 15 release, I have no plans to merge this to 14). This fea­ture is called “Ser­vice Jails”. When you enable it, it takes a ser­vice (some­thing which is start­ed by an rc-script at boot or by hand via service(8)) and starts it in a jail(8). It can do this with any ser­vice, and with no more than 2 lines of configuration.

For those which don’t know, a jail is some kind of con­tain­er tech­nol­o­gy. We have this tech­nol­o­gy since 1999 (so it pre-dates Dock­er by 14 years). It served as an inspi­ra­tion for Solaris zones.

Too good to be true?

Con­tainer­iz­ing some soft­ware with only 2 lines of code (if a ser­vice is “Ser­vice Jail ready”, only one line) sounds amaz­ing, and is in no way com­pa­ra­ble to a dock­er file or a nor­mal jail con­fig. This sounds a bit too good to be true. And that is cor­rect. The ser­vice Jails frame­work is some­where between a ful­ly iso­lat­ed con­tain­er, and no con­tainer­iza­tion at all.

The biggest dif­fer­ence is that a Ser­vice Jail has full access to the entire filesys­tem of the host (or par­ent jail), except for chflags(1). This means if your ser­vice runs as root in the Ser­vice Jail, and it is com­pro­mised, the attack­er is able to read your pass­word data­base, and mod­i­fy near­ly any file con­tent (and as such on next boot any­thing can hap­pen). The only excep­tion to this is, if this ser­vice already has pro­vi­sions to run in a chroot and this is enabled. In that case only files in the chroot can be mod­i­fied by an attacker.

What are the benefits?

Com­pared to run­ning a ser­vice on the host itself with­out putting it into a jail you have cre­at­ed your­self and tai­lored to only the soft­ware you need, you have the ben­e­fit of lim­it­ing what the soft­ware (or an intrud­er) is able to do, but not the ben­e­fit of a min­i­mal soft­ware install.

When you enable a Ser­vice Jail for a par­tic­u­lar ser­vice which is not Ser­vice Jail ready and you do not pro­vide a Ser­vice Jail con­fig, the ser­vice is start­ed inside a jail with­out any net­work access and full access to the filesys­tem. A Ser­vice Jail ready ser­vice which nor­mal­ly needs net­work access, can be lim­it­ed by a cus­tom con­fig to not have any net­work access at all (or only to IPv6 and not IPv4, or vice ver­sa). This means you can lim­it this soft­ware to not access the net­work despite not hav­ing it inside a VM.

The Ser­vice Jail also does­n’t allow to:

  • mount filesys­tems (and on pur­pose there is no pro­vi­sion so far to option­al­ly allow this),
  • open raw sock­ets (can be enabled),
  • open sock­ets of pro­to­col stacks that have not had jail func­tion­al­i­ty added (IPv4, IPv6, local UNIX sock­ets and rout­ing stuff are jail-aware) to them (can be enabled),
  • lock/unlock phys­i­cal pages in mem­o­ry (can be enabled),
  • use Sys­tem V IPC facil­i­ties (can be enabled),
  • use debug­ging facil­i­ties for unpriv­i­leged processes,
  • see process­es from the host or oth­er jails,

and all the oth­er stuff which is pro­hib­it­ed in jails by default.

When you enable net­work access (IPv4 and/or IPv6) for a par­tic­u­lar ser­vice, the Ser­vice Jail inher­its all the IPs of the host (or par­ent jail). This means you can not run two ser­vices which want to lis­ten on the same port by this. But as you can lim­it it to get only access to IPv4 and not IPv6 (and vice ver­sa), it means you could run two dif­fer­ent ser­vices for IPv4 and IPv6, or you can test sce­nar­ios where only one IP stack is avail­able but have the host itself con­fig­ured for dual-stack net­work access.

How to get most out of Ser­vice Jails?

With the pos­si­bil­i­ty to allow unpriv­i­leged users to open priv­i­leged ports (sysctl net.inet.ip.portrange.reservedhigh=0) and hav­ing the ser­vice start­ed as non-root (sysrc servicename_user=MyServiceUser), a Ser­vice Jail pro­vides a very good ben­e­fit for a sim­ple one-line con­fig change (sysrc servicename_svcj=YES).

In this case filesys­tem access is restrict­ed to what this par­tic­u­lar user is able to read / write, only process­es start­ed by the ser­vice are vis­i­ble to the ser­vice, and all the oth­er jail-restrictions apply. An intrud­er may as such do bad things to this par­tic­u­lar ser­vice, but not to oth­er ser­vices on the system.

For a read-only web­serv­er this may mean an attack­er may be able to mod­i­fy some log files, but can not see oth­er process­es run­ning on the sys­tem and deduct­ing from them what is the most valu­able next step in the attack.

For a read-only php-fpm ser­vice it may mean that the attack­er can run some in-memory code to spawn a bot­net, but not com­pro­mise oth­er parts of the host or access Sys­tem V mem­o­ry loca­tions of a data­base (if the php-fpm ser­vice is not con­fig­ured to allow access to Sys­tem V resources).

Fur­ther reading

The rc.conf(8) man page con­tains more info about what can be enabled for Ser­vice Jails (search for “svcj” and “SERVICE JAILS”). The rc script­ing arti­cle explains how to make a ser­vice Ser­vice Jails ready, and the FreeB­SD hand­book con­tains a sec­tion how to enable and con­fig­ure Ser­vice Jails.

What’s next?

The base sys­tem ser­vices are either made Ser­vice Jails aware, or con­fig­ured to not run inside a ser­vice jail (e.g. a fsck does­n’t make sense to run in a jail). Not all of the ser­vices are test­ed with Ser­vice Jails. Give them a try and send a bug report in case some­thing does­n’t work.

The FreeB­SD ports col­lec­tion has about 1500 ser­vices. I’ve either com­mit­ted already some patch­es or send patch­es to the main­tain­ers for some of the high pro­file ports (like web­servers, data­bas­es, DNS servers, …) to make some of them Ser­vice Jails ready, but there are too much ser­vices to do that all myself. Feel free to sub­mit some patch­es for them.

Plot­ting the FreeB­SD mem­o­ry frag­men­ta­tion – part 2

If you haven’t read part 1 already, please do so. Else you will not under­stand what this is about (I don’t repeat the basics here).

The fol­low­ing graphs show the FMFI with D45043, D45045 and D45046 applied.

When you look at the graphs, keep in mind that I updat­ed FreeB­SD on 2024−05−27−120546 and 2024−06−04−105830. None of those updates intro­duced changes in the mem­o­ry allo­ca­tion area, so the results should be some­what comparable.

I used the same work­loads as in part 1 (not a deter­min­is­tic bench­mark, real world use case with 30 jails and var­i­ous pack­age build runs).

First the 2nd last of the graphs from part 1 to have some­thing to com­pare against:

Now with the 3 changes list­ed above:

Just by look­ing at the graphs, and giv­en that I don’t run a fixed bench­mark but this is plot­ted from real-world use, I don’t think we can draw a con­clu­sion by look­ing at the FMFI which is plot­ted here (oth­er than it does no bad for my workload).

The com­ment in the D45046 review about the reduced num­ber of reser­va­tions with at least one NOFREE page (= a page which will nev­er be freed) looks good. Hav­ing about 20 times less reser­va­tions with NOFREE pages means 20 times less NOFREE pages scat­tered around in mem­o­ry. Those NOFREE pages can get in the way for larg­er allo­ca­tions. The­o­ret­i­cal­ly more mem­o­ry areas can be com­bined (if need­ed). Prac­ti­cal­ly this is not the case yet. There is a slight hint in the mea­sure­ment in the com­ment in the review that there are some more PDE (“Page Direc­to­ry Entry”) pro­mo­tions, but they scratch at the 1 – 2% mar­gin. I do not expect this results in a notice­able effect on performance.

Nev­er­the­less, this looks very promis­ing. It paves the way for fur­ther work as there are less NOFREE pages scat­tered around. This may make mem­o­ry defrag­men­ta­tion / com­paction tech­niques more use­ful. Once those are mature enough to be test­ed on real world stuff, I will gen­er­ate some plots.

Plot­ting the FreeB­SD mem­o­ry fragmentation

I stum­bled upon the work of Bojan Novković regard­ing phys­i­cal mem­o­ry anti-fragmentation mech­a­nisms. As a pic­ture some­times tells more than 1000 words…

What this is about

I stum­bled upon the work of Bojan Novković regard­ing phys­i­cal mem­o­ry anti-fragmentation mech­a­nisms (atten­tion, this is a link to the FreeB­SD wiki, con­tent may change with­out fur­ther notice). As a pic­ture some­times tells more than 1000 words, I want­ed to see a graph­i­cal rep­re­sen­ta­tion of the frag­men­ta­tion. Not in terms of which mem­o­ry regions are frag­ment­ed, but in terms of how frag­ment­ed the UMA buck­ets page allo­ca­tor freel­ists are.

Bojan has a frag­men­ta­tion met­ric (FMFI) for UMA avail­able as a patch which gives a numer­ic rep­re­sen­ta­tion of the frag­men­ta­tion, but no graphs.

After a bit of tin­ker­ing around with gnu­plot, I came up with some way of graph­ing it.

How to cre­ate some graphs

First you need some data to plot a graph. Col­lect­ing the FMFI stats is easy. A lit­tle cron-job which runs this peri­od­i­cal­ly is enough:

#!/bin/sh
 
boottime=$(sysctl kern.boottime 2>&1 | awk '{print $5}' | sed -e 's:,::')
time=$(date -r ${boottime} +%Y%m%d_%H%M)
 
logfile=/var/tmp/vm_frag_${time}.log
 
touch ${logfile}
date "+%Y-%m-%d_%H:%M:%S" >> ${logfile}
sysctl vm.phys_frag_idx >> ${logfile}
echo >> ${logfile}

This cre­ates log files in /var/tmp with the for­mat­ted boot time in the file­name, so that there is an easy indi­ca­tion of a reset of the fragmentation.

After a while you should have some logs to parse. Gnu­plot can not work with the sim­ple log gen­er­at­ed by the cron-job, so a CSV needs to be gen­er­at­ed. The fol­low­ing awk script (parse_vm_frag.awk) gen­er­ates the CSV. In my case there is only one NUMA domain, so my awk script to parse the data does­n’t care about NUMA domains.

/....-..-.._..:..:../ { date = $0 }
/vm.phys_frag_idx: / { next }
/DOMAIN/ { next }
 
/  ORDER (SIZE) |  FMFI/ { next }
/--/ { next }
/  .. \( .....K\) / { printf "%d %s %d\n", $1, date, $5; next }

Next step is a tem­plate (template.gnuplot) for the plots:

set terminal svg dynamic mouse standalone name "%%NAME%%"
# set terminal png size 1920,1280
set output "%%NAME%%.svg"
 
set title '%%NAME%%' noenhanced
set xdata time
set timefmt "%Y-%m-%d_%H:%M:%S"
set xlabel "Date Time"
set zlabel "Memory Fragmentation Index" rotate by 90
set ylabel "freelist size"
set zrange [-1000:1000]
set yrange [0:12]
set ytics 1
# the following rotate doesn't work, at least chrome doesn't rotate the dates on the x-axis at all
set xtics rotate by 90 format "%F %T" timedate 
set xyplane 0
set grid vertical
set border 895
 
splot "%%NAME%%.csv" using 2:1:3 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 1 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 2 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 3 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 4 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 5 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 6 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 7 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 8 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 9 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 10 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 11 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 12 every 13 t '' with filledcurve

Unfor­tu­nate­ly I did­n’t get the above to work with gnuplot-variables with­in 5 min­utes, so I cre­at­ed a lit­tle script to gen­er­ate a plot-script for each CSV file.

#!/bin/sh
 
for log in vm_frag_*.log; do
        base=$(basename ${log} .log)
        awk -f parse_vm_frag.awk <${log} >${base}.csv
 
        cp template.gnuplot ${base}.gnuplot
        sed -i -e "s:%%NAME%%:${base}:g" ${base}.gnuplot
done

Now it’s sim­ply “gnuplot *.gnuplot” (assum­ing the CSV files and the gnu­plot files are in the same direc­to­ry), and you will get SVG graphs.

Some back­ground info

And here are the results of run­ning this for some days on a 2 sock­et, 6 core each plus hyper­thread­ing Intel Xeon sys­tem with 72 GB RAM. This sys­tems has about 30 dif­fer­ent jails with a diverse mix of nginx, mysql, post­gresql, redis, imap, smtp, var­i­ous java stuff, …, poudriere (3 work­ers) and build­world runs (about 30 jails not count­ing the poudriere runs). So the fol­low­ing graphs are not done in a repro­ducible way, but sim­ply the result of real-world appli­ca­tions run­ning all day long. Each new graph means there was a reboot. All reboots where done do update to a more recent FreeBSD-current.

All in all, not only the appli­ca­tion work­load was always dif­fer­ent, but also the run­ning ker­nel was different.

The graphs

Beware! You can not real­ly com­pare one graph with anoth­er. They do not rep­re­sent the same work­load. As such any con­clu­sion you (or I) want to draw from this is more an indi­ca­tion than a proven fact. Big dif­fer­ences will be vis­i­ble, small changes may go unnoticed.

This is the graph of FreeBSD-current from around 2024-04-08. There are var­i­ous mod­i­fi­ca­tions com­pared to a stock FreeB­SD sys­tem, but the only change in the mem­o­ry area is the FMFI patch men­tioned above.

Expla­na­tion of what you see

A mem­o­ry frag­men­ta­tion index of 1000 is bad. It means the mem­o­ry is very frag­ment­ed. A val­ue of 0 means there is no frag­men­ta­tion, and a neg­a­tive val­ue means it is very easy to sat­is­fy an allo­ca­tion request.

So bars which go up are bad, bars which go down are good.

The page allo­ca­tor UMA buck­et freel­ists axis (dif­fer­ent col­ors for each size-rank) denotes the allo­ca­tion size. UMA buck­et Freel­ist size 0 is about 4k allo­ca­tions, and each size-increase dou­bles the allo­ca­tion size up to 16M at size-rank 12.

In the above graph for UMA buck­et freel­ist size 0 all val­ues are neg­a­tive. This means that all allo­ca­tions of upto 4k was always easy and no frag­men­ta­tion was noticed. This is not a sur­prise, giv­en that this is the small­est allo­ca­tion size.

The fact that already UMA buck­et freel­ist size 1 (8k allo­ca­tions) had already that much frag­men­ta­tion was a sur­prise to me at that point. But see the next part for a fix for this.

An imme­di­ate fix which pre­vents some of the fragmentation

The next graph is with a world from around 2024-04-14. It con­tains Bojans com­mit which pre­vents a bit of mem­o­ry frag­men­ta­tion around ker­nel stack guard pages.

Here it seems that Bojans fix had an imme­di­ate effect on buck­et freel­ist size 1 (8k allo­ca­tion size). It stays in “good shape” for a longer peri­od of time. Here in this graph we see an improve­ment at the begin­ning until upto buck­et size 6 (256k allo­ca­tion size). The graphs below even show an improve­ment over sev­er­al days of may upto UMA buck­et size 3 (32k allo­ca­tion size).

One of the next things I want to try (and plot) is review D16620 which seg­re­gates *_nofree allo­ca­tions per allo­ca­tion (a small patch) and I’m also inter­est­ed to see what effect review D40772 has.

Some more graphs

Some more graphs, each one from an updat­ed FreeBSD-current sys­tem (dates in the graph rep­re­sent the reboot into the cor­re­spond­ing new world). Chrome was rebuild by poudriere (con­sumes a lot of RAM rel­a­tive to oth­er pack­ages) sev­er­al times dur­ing those graphs.

Solaris: script to check var­i­ous set­tings of a sys­tem if they com­ply to some pre-defined settings

Prob­lem

If you set­up a sys­tem, you want to make sure that it com­plies to a pre-defined con­fig. You can do that with some con­fig­u­ra­tion man­age­ment sys­tem, but there are cas­es where it use­ful to do that out­side of this context.

Solu­tion

The shell script below I start­ed to write in 2008. Over time (until 2016) it extend­ed into some­thing which is able to out­put a report of over 1000 items. You can con­fig­ure it via ${HOME}/.check_host.cfg and /etc/check_host.cfg (it reads both in this order, first con­fig wins and oth­er con­fig is not read). You can use option “-h” to see the usage text. Option “-n” sup­press­es mes­sages which help to fix issues, “-a” prints out sim­ple HTML instead of text.

Solaris: script to cre­ate com­mands to set­up LDOMs based upon out­put from “ldm ls”

Prob­lem

You have a LDOM which you want to clone to some­where else and all you have to per­form that is the ldm com­mand on the tar­get system.

Solu­tion

Down­load the AWK script below. Use the out­put of “ldm ls ‑l ‑p <ldom>” as the input of this AWK script. The out­put will be a list of com­mands to re-create the con­fig for VDS, VDISK, VSW and NETWORK.

I wrote this in 2013, so changes to the out­put of “ldm ls” since then are not account­ed for.