May, 2024 | Alexander Leidinger

What this is about

I stumbled upon the work of Bojan Novković regarding physical memory anti-fragmentation mechanisms (attention, this is a link to the FreeBSD wiki, content may change without further notice). As a picture sometimes tells more than 1000 words, I wanted to see a graphical representation of the fragmentation. Not in terms of which memory regions are fragmented, but in terms of how fragmented the ~~UMA buckets~~ page allocator freelists are.

Bojan has a fragmentation metric (FMFI) for UMA available as a patch which gives a numeric representation of the fragmentation, but no graphs.

After a bit of tinkering around with gnuplot, I came up with some way of graphing it.

How to create some graphs

First you need some data to plot a graph. Collecting the FMFI stats is easy. A little cron-job which runs this periodically is enough:

#!/bin/sh
 
boottime=$(sysctl kern.boottime 2>&1 | awk '{print $5}' | sed -e 's:,::')
time=$(date -r ${boottime} +%Y%m%d_%H%M)
 
logfile=/var/tmp/vm_frag_${time}.log
 
touch ${logfile}
date "+%Y-%m-%d_%H:%M:%S" >> ${logfile}
sysctl vm.phys_frag_idx >> ${logfile}
echo >> ${logfile}

This creates log files in /var/tmp with the formatted boot time in the filename, so that there is an easy indication of a reset of the fragmentation.

After a while you should have some logs to parse. Gnuplot can not work with the simple log generated by the cron-job, so a CSV needs to be generated. The following awk script (parse_vm_frag.awk) generates the CSV. In my case there is only one NUMA domain, so my awk script to parse the data doesn’t care about NUMA domains.

/....-..-.._..:..:../ { date = $0 }
/vm.phys_frag_idx: / { next }
/DOMAIN/ { next }
 
/  ORDER (SIZE) |  FMFI/ { next }
/--/ { next }
/  .. \( .....K\) / { printf "%d %s %d\n", $1, date, $5; next }

Next step is a template (template.gnuplot) for the plots:

set terminal svg dynamic mouse standalone name "%%NAME%%"
# set terminal png size 1920,1280
set output "%%NAME%%.svg"
 
set title '%%NAME%%' noenhanced
set xdata time
set timefmt "%Y-%m-%d_%H:%M:%S"
set xlabel "Date Time"
set zlabel "Memory Fragmentation Index" rotate by 90
set ylabel "freelist size"
set zrange [-1000:1000]
set yrange [0:12]
set ytics 1
# the following rotate doesn't work, at least chrome doesn't rotate the dates on the x-axis at all
set xtics rotate by 90 format "%F %T" timedate 
set xyplane 0
set grid vertical
set border 895
 
splot "%%NAME%%.csv" using 2:1:3 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 1 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 2 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 3 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 4 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 5 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 6 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 7 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 8 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 9 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 10 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 11 every 13 t '' with filledcurve, \
      "" using 2:1:3 skip 12 every 13 t '' with filledcurve

Unfortunately I didn’t get the above to work with gnuplot-variables within 5 minutes, so I created a little script to generate a plot-script for each CSV file.

#!/bin/sh
 
for log in vm_frag_*.log; do
        base=$(basename ${log} .log)
        awk -f parse_vm_frag.awk <${log} >${base}.csv
 
        cp template.gnuplot ${base}.gnuplot
        sed -i -e "s:%%NAME%%:${base}:g" ${base}.gnuplot
done

Now it’s simply “gnuplot *.gnuplot” (assuming the CSV files and the gnuplot files are in the same directory), and you will get SVG graphs.

Some background info

And here are the results of running this for some days on a 2 socket, 6 core each plus hyperthreading Intel Xeon system with 72 GB RAM. This systems has about 30 different jails with a diverse mix of nginx, mysql, postgresql, redis, imap, smtp, various java stuff, …, poudriere (3 workers) and buildworld runs (about 30 jails not counting the poudriere runs). So the following graphs are not done in a reproducible way, but simply the result of real-world applications running all day long. Each new graph means there was a reboot. All reboots where done do update to a more recent FreeBSD-current.

All in all, not only the application workload was always different, but also the running kernel was different.

The graphs

Beware! You can not really compare one graph with another. They do not represent the same workload. As such any conclusion you (or I) want to draw from this is more an indication than a proven fact. Big differences will be visible, small changes may go unnoticed.

This is the graph of FreeBSD-current from around 2024-04-08. There are various modifications compared to a stock FreeBSD system, but the only change in the memory area is the FMFI patch mentioned above.

Explanation of what you see

A memory fragmentation index of 1000 is bad. It means the memory is very fragmented. A value of 0 means there is no fragmentation, and a negative value means it is very easy to satisfy an allocation request.

So bars which go up are bad, bars which go down are good.

The page allocator ~~UMA bucket~~ freelists axis (different colors for each size-rank) denotes the allocation size. ~~UMA bucket~~ Freelist size 0 is about 4k allocations, and each size-increase doubles the allocation size up to 16M at size-rank 12.

In the above graph for ~~UMA bucket~~ freelist size 0 all values are negative. This means that all allocations of upto 4k was always easy and no fragmentation was noticed. This is not a surprise, given that this is the smallest allocation size.

The fact that already ~~UMA bucket~~ freelist size 1 (8k allocations) had already that much fragmentation was a surprise to me at that point. But see the next part for a fix for this.

An immediate fix which prevents some of the fragmentation

The next graph is with a world from around 2024-04-14. It contains Bojans commit which prevents a bit of memory fragmentation around kernel stack guard pages.

Here it seems that Bojans fix had an immediate effect on ~~bucket~~ freelist size 1 (8k allocation size). It stays in “good shape” for a longer period of time. Here in this graph we see an improvement at the beginning until upto ~~bucket~~ size 6 (256k allocation size). The graphs below even show an improvement over several days of may upto ~~UMA bucket~~ size 3 (32k allocation size).

One of the next things I want to try (and plot) is review D16620 which segregates *_nofree allocations per allocation (a small patch) and I’m also interested to see what effect review D40772 has.

Some more graphs

Some more graphs, each one from an updated FreeBSD-current system (dates in the graph represent the reboot into the corresponding new world). Chrome was rebuild by poudriere (consumes a lot of RAM relative to other packages) several times during those graphs.