Software | Alexander Leidinger

FreeBSD Service Jails – another layer in the security onion

In May I committed a new feature to FreeBSD-current (it will be in the FreeBSD 15 release, I have no plans to merge this to 14). This feature is called “Service Jails”. When you enable it, it takes a service (something which is started by an rc-script at boot or by hand via service(8)) and starts it in a jail(8). It can do this with any service, and with no more than 2 lines of configuration.

For those which don’t know, a jail is some kind of container technology. We have this technology since 1999 (so it pre-dates Docker by 14 years). It served as an inspiration for Solaris zones.

Too good to be true?

Containerizing some software with only 2 lines of code (if a service is “Service Jail ready”, only one line) sounds amazing, and is in no way comparable to a docker file or a normal jail config. This sounds a bit too good to be true. And that is correct. The service Jails framework is somewhere between a fully isolated container, and no containerization at all.

The biggest difference is that a Service Jail has full access to the entire filesystem of the host (or parent jail), except for chflags(1). This means if your service runs as root in the Service Jail, and it is compromised, the attacker is able to read your password database, and modify nearly any file content (and as such on next boot anything can happen). The only exception to this is, if this service already has provisions to run in a chroot and this is enabled. In that case only files in the chroot can be modified by an attacker.

What are the benefits?

Compared to running a service on the host itself without putting it into a jail you have created yourself and tailored to only the software you need, you have the benefit of limiting what the software (or an intruder) is able to do, but not the benefit of a minimal software install.

When you enable a Service Jail for a particular service which is not Service Jail ready and you do not provide a Service Jail config, the service is started inside a jail without any network access and full access to the filesystem. A Service Jail ready service which normally needs network access, can be limited by a custom config to not have any network access at all (or only to IPv6 and not IPv4, or vice versa). This means you can limit this software to not access the network despite not having it inside a VM.

The Service Jail also doesn’t allow to:

mount filesystems (and on purpose there is no provision so far to optionally allow this),
open raw sockets (can be enabled),
open sockets of protocol stacks that have not had jail functionality added (IPv4, IPv6, local UNIX sockets and routing stuff are jail-aware) to them (can be enabled),
lock/unlock physical pages in memory (can be enabled),
use System V IPC facilities (can be enabled),
use debugging facilities for unprivileged processes,
see processes from the host or other jails,

and all the other stuff which is prohibited in jails by default.

When you enable network access (IPv4 and/or IPv6) for a particular service, the Service Jail inherits all the IPs of the host (or parent jail). This means you can not run two services which want to listen on the same port by this. But as you can limit it to get only access to IPv4 and not IPv6 (and vice versa), it means you could run two different services for IPv4 and IPv6, or you can test scenarios where only one IP stack is available but have the host itself configured for dual-stack network access.

How to get most out of Service Jails?

With the possibility to allow unprivileged users to open privileged ports (sysctl net.inet.ip.portrange.reservedhigh=0) and having the service started as non-root (sysrc servicename_user=MyServiceUser), a Service Jail provides a very good benefit for a simple one-line config change (sysrc servicename_svcj=YES).

In this case filesystem access is restricted to what this particular user is able to read / write, only processes started by the service are visible to the service, and all the other jail-restrictions apply. An intruder may as such do bad things to this particular service, but not to other services on the system.

For a read-only webserver this may mean an attacker may be able to modify some log files, but can not see other processes running on the system and deducting from them what is the most valuable next step in the attack.

For a read-only php-fpm service it may mean that the attacker can run some in-memory code to spawn a botnet, but not compromise other parts of the host or access System V memory locations of a database (if the php-fpm service is not configured to allow access to System V resources).

What’s next?

The base system services are either made Service Jails aware, or configured to not run inside a service jail (e.g. a fsck doesn’t make sense to run in a jail). Not all of the services are tested with Service Jails. Give them a try and send a bug report in case something doesn’t work.

The FreeBSD ports collection has about 1500 services. I’ve either committed already some patches or send patches to the maintainers for some of the high profile ports (like webservers, databases, DNS servers, …) to make some of them Service Jails ready, but there are too much services to do that all myself. Feel free to submit some patches for them.

Share/Save

Plotting the FreeBSD memory fragmentation – part 2

If you haven’t read part 1 already, please do so. Else you will not understand what this is about (I don’t repeat the basics here).

The following graphs show the FMFI with D45043, D45045 and D45046 applied.

When you look at the graphs, keep in mind that I updated FreeBSD on 2024−05−27−120546 and 2024−06−04−105830. None of those updates introduced changes in the memory allocation area, so the results should be somewhat comparable.

I used the same workloads as in part 1 (not a deterministic benchmark, real world use case with 30 jails and various package build runs).

First the 2^nd last of the graphs from part 1 to have something to compare against:

Now with the 3 changes listed above:

Just by looking at the graphs, and given that I don’t run a fixed benchmark but this is plotted from real-world use, I don’t think we can draw a conclusion by looking at the FMFI which is plotted here (other than it does no bad for my workload).

The comment in the D45046 review about the reduced number of reservations with at least one NOFREE page (= a page which will never be freed) looks good. Having about 20 times less reservations with NOFREE pages means 20 times less NOFREE pages scattered around in memory. Those NOFREE pages can get in the way for larger allocations. Theoretically more memory areas can be combined (if needed). Practically this is not the case yet. There is a slight hint in the measurement in the comment in the review that there are some more PDE (“Page Directory Entry”) promotions, but they scratch at the 1 – 2% margin. I do not expect this results in a noticeable effect on performance.

Nevertheless, this looks very promising. It paves the way for further work as there are less NOFREE pages scattered around. This may make memory defragmentation / compaction techniques more useful. Once those are mature enough to be tested on real world stuff, I will generate some plots.

Share/Save

Plotting the FreeBSD memory fragmentation

I stumbled upon the work of Bojan Novković regarding physical memory anti-fragmentation mechanisms. As a picture sometimes tells more than 1000 words…

Share/Save

What this is about

I stumbled upon the work of Bojan Novković regarding physical memory anti-fragmentation mechanisms (attention, this is a link to the FreeBSD wiki, content may change without further notice). As a picture sometimes tells more than 1000 words, I wanted to see a graphical representation of the fragmentation. Not in terms of which memory regions are fragmented, but in terms of how fragmented the ~~UMA buckets~~ page allocator freelists are.

Bojan has a fragmentation metric (FMFI) for UMA available as a patch which gives a numeric representation of the fragmentation, but no graphs.

After a bit of tinkering around with gnuplot, I came up with some way of graphing it.

How to create some graphs

First you need some data to plot a graph. Collecting the FMFI stats is easy. A little cron-job which runs this periodically is enough:

#!/bin/sh
 
boottime=$(sysctl kern.boottime 2>&1 | awk '{print $5}' | sed -e 's:,::')
time=$(date -r ${boottime} +%Y%m%d_%H%M)
 
logfile=/var/tmp/vm_frag_${time}.log
 
touch ${logfile}
date "+%Y-%m-%d_%H:%M:%S" >> ${logfile}
sysctl vm.phys_frag_idx >> ${logfile}
echo >> ${logfile}

This creates log files in /var/tmp with the formatted boot time in the filename, so that there is an easy indication of a reset of the fragmentation.

After a while you should have some logs to parse. Gnuplot can not work with the simple log generated by the cron-job, so a CSV needs to be generated. The following awk script (parse_vm_frag.awk) generates the CSV. In my case there is only one NUMA domain, so my awk script to parse the data doesn’t care about NUMA domains.

/....-..-.._..:..:../ { date = $0 }
/vm.phys_frag_idx: / { next }
/DOMAIN/ { next }
 
/  ORDER (SIZE) |  FMFI/ { next }
/--/ { next }
/  .. \( .....K\) / { printf "%d %s %d\n", $1, date, $5; next }

Next step is a template (template.gnuplot) for the plots:

set terminal svg dynamic mouse standalone name "%%NAME%%"
# set terminal png size 1920,1280
set output "%%NAME%%.svg"
 
set title '%%NAME%%' noenhanced
set xdata time
set timefmt "%Y-%m-%d_%H:%M:%S"
set xlabel "Date Time"
set zlabel "Memory Fragmentation Index" rotate by 90
set ylabel "freelist size"
set zrange [-1000:1000]
set yrange [0:12]
set ytics 1
# the following rotate doesn't work, at least chrome doesn't rotate the dates on the x-axis at all
set xtics rotate by 90 format "%F %T" timedate 
set xyplane 0
set grid vertical
set border 895
 
splot "%%NAME%%.csv" using 2:1:3 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 1 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 2 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 3 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 4 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 5 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 6 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 7 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 8 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 9 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 10 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 11 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 12 every 13 t '' with filledcurve

Unfortunately I didn’t get the above to work with gnuplot-variables within 5 minutes, so I created a little script to generate a plot-script for each CSV file.

#!/bin/sh
 
for log in vm_frag_*.log; do
        base=$(basename ${log} .log)
        awk -f parse_vm_frag.awk <${log} >${base}.csv
 
        cp template.gnuplot ${base}.gnuplot
        sed -i -e "s:%%NAME%%:${base}:g" ${base}.gnuplot
done

Now it’s simply “gnuplot *.gnuplot” (assuming the CSV files and the gnuplot files are in the same directory), and you will get SVG graphs.

Some background info

And here are the results of running this for some days on a 2 socket, 6 core each plus hyperthreading Intel Xeon system with 72 GB RAM. This systems has about 30 different jails with a diverse mix of nginx, mysql, postgresql, redis, imap, smtp, various java stuff, …, poudriere (3 workers) and buildworld runs (about 30 jails not counting the poudriere runs). So the following graphs are not done in a reproducible way, but simply the result of real-world applications running all day long. Each new graph means there was a reboot. All reboots where done do update to a more recent FreeBSD-current.

All in all, not only the application workload was always different, but also the running kernel was different.

The graphs

Beware! You can not really compare one graph with another. They do not represent the same workload. As such any conclusion you (or I) want to draw from this is more an indication than a proven fact. Big differences will be visible, small changes may go unnoticed.

This is the graph of FreeBSD-current from around 2024-04-08. There are various modifications compared to a stock FreeBSD system, but the only change in the memory area is the FMFI patch mentioned above.

Explanation of what you see

A memory fragmentation index of 1000 is bad. It means the memory is very fragmented. A value of 0 means there is no fragmentation, and a negative value means it is very easy to satisfy an allocation request.

So bars which go up are bad, bars which go down are good.

The page allocator ~~UMA bucket~~ freelists axis (different colors for each size-rank) denotes the allocation size. ~~UMA bucket~~ Freelist size 0 is about 4k allocations, and each size-increase doubles the allocation size up to 16M at size-rank 12.

In the above graph for ~~UMA bucket~~ freelist size 0 all values are negative. This means that all allocations of upto 4k was always easy and no fragmentation was noticed. This is not a surprise, given that this is the smallest allocation size.

The fact that already ~~UMA bucket~~ freelist size 1 (8k allocations) had already that much fragmentation was a surprise to me at that point. But see the next part for a fix for this.

An immediate fix which prevents some of the fragmentation

The next graph is with a world from around 2024-04-14. It contains Bojans commit which prevents a bit of memory fragmentation around kernel stack guard pages.

Here it seems that Bojans fix had an immediate effect on ~~bucket~~ freelist size 1 (8k allocation size). It stays in “good shape” for a longer period of time. Here in this graph we see an improvement at the beginning until upto ~~bucket~~ size 6 (256k allocation size). The graphs below even show an improvement over several days of may upto ~~UMA bucket~~ size 3 (32k allocation size).

One of the next things I want to try (and plot) is review D16620 which segregates *_nofree allocations per allocation (a small patch) and I’m also interested to see what effect review D40772 has.

Some more graphs

Some more graphs, each one from an updated FreeBSD-current system (dates in the graph represent the reboot into the corresponding new world). Chrome was rebuild by poudriere (consumes a lot of RAM relative to other packages) several times during those graphs.

Share/Save

Solaris: script to check various settings of a system if they comply to some pre-defined settings

Problem

If you setup a system, you want to make sure that it complies to a pre-defined config. You can do that with some configuration management system, but there are cases where it useful to do that outside of this context.

Solution

The shell script below I started to write in 2008. Over time (until 2016) it extended into something which is able to output a report of over 1000 items. You can configure it via ${HOME}/.check_host.cfg and /etc/check_host.cfg (it reads both in this order, first config wins and other config is not read). You can use option “-h” to see the usage text. Option “-n” suppresses messages which help to fix issues, “-a” prints out simple HTML instead of text.

check_host_sanitized.sh Download

Share/Save

Solaris: script to create commands to setup LDOMs based upon output from “ldm ls”

Problem

You have a LDOM which you want to clone to somewhere else and all you have to perform that is the ldm command on the target system.

Solution

Download the AWK script below. Use the output of “ldm ls ‑l ‑p <ldom>” as the input of this AWK script. The output will be a list of commands to re-create the config for VDS, VDISK, VSW and NETWORK.

I wrote this in 2013, so changes to the output of “ldm ls” since then are not accounted for.

gen_ldm_config.awk Download

Share/Save

Category: Software

FreeBSD Service Jails – another layer in the security onion

Too good to be true?

What are the benefits?

How to get most out of Service Jails?

Further reading

What’s next?

Plotting the FreeBSD memory fragmentation – part 2

Plotting the FreeBSD memory fragmentation

What this is about

How to create some graphs

Some background info

The graphs

Explanation of what you see

An immediate fix which prevents some of the fragmentation

Some more graphs

Solaris: script to check various settings of a system if they comply to some pre-defined settings

Problem

Solution

Solaris: script to create commands to setup LDOMs based upon output from “ldm ls”

Problem

Solution

Too good to be true?

What are the benefits?

How to get most out of Ser­vice Jails?

Fur­ther reading

What’s next?

What this is about

How to cre­ate some graphs

Some back­ground info

The graphs

Expla­na­tion of what you see

An imme­di­ate fix which pre­vents some of the fragmentation

Some more graphs

Prob­lem

Solu­tion

Prob­lem

Solu­tion

How to get most out of Service Jails?

Further reading

How to create some graphs

Some background info

Explanation of what you see

An immediate fix which prevents some of the fragmentation

Problem

Solution

Problem

Solution