Problem / Story
In 2017 we replaced a storage system with another storage system (actually, we replaced the complete SAN infrastructure). We handled it by attaching both storage systems to VMware (v5.5) and migrating the datastores. In this process we stumbled upon issues which made some hosts unresponsive in VCenter (while the VMs were running without issues). Before the hosts went unresponsive, the performance graphs of them started to blank out. So at the moment the issue appeared until it was resolved, any graph continued to advance, but had no values listed in the corresponding timeframe (left = colorful lines, middle = white space, and after the issue was resolved the colorful lines appeared again). Some times the issue of the blank performance graph resolved itself, sometimes the hosts became unresponsive and VCenter greyed them out and triggered a HA/FT (High Availability / Fault Tolerance) reaction.
Root cause
On the corresponding hosts we had RDMs (Raw Device Mappings) which are used by Microsoft Cluster Service (there is a knowledge-base article). The issues showed up when we did some SAN operations in VMware (like (automatic) scanning) of new disks after having presented new disks to VMware. VMware tried to do something clever with the disks (also during the boot of a host, so if you use RDMs and booting the host takes a long time, you are in the situation I describe here). If only a small amount of changes happened at the same time, the issues fixed itself. A large amount of changes caused a HA/FT reaction.
Workaround when the issue shows up
When you see that the performance graphs start to show blank space and your VMs are still working, go to the cluster settings and disable vSphere HA (High Availability): cluster -> “Edit Settings” -> “Cluster Features” -> remove checkmark in front of “Turn On vSphere HA”. Wait until the graph shows some values again (for all involved hosts) and then enable vSphere HA again.
Solution
To not have this issue show up at all, you need to change some settings for the devices on which you have the RDMs. Here is a little script (small enough to jsut copy&paste it into a shell on the host) which needs the IDs of the devices which are used for the RDMs (attention, letters need to be lowercase) in the “RDMS” variable. As we did that on the running systems, and each change of the settings caused some action in in he background which made the perfromance graph issue to show up, there is a “little” sleep between the making the changes. The amount of sleep depends upon your situation, the more RDMs are configured, the bigger it needs to be. For us we had 15 of such devices and a sleep of 20 minutes between each change was enough to not trigger a HA/FT reaction. The amount of time needed in the end is much lower than in the beginning, but as this was more or less an one-off task, this simple version was good enough (it checks if the setting is already active and does nothing in this case).
For our use case it was also beneficial to the the path selection policy to fixed, so this is also included in this script. Your use case may be different.
SLEEPTIME=1200 # 20 minutes per LDEV!
# REPLACE THE FOLLOWING IDs !!! lower case !!!
RDMS="1234567890abcdef12345c42000002a2 1234567890abcdef12345c42000003a3 \
1234567890abcdef12345c42000003a4 1234567890abcdef12345c42000002a5 \
1234567890abcdef12345c42000002a6 1234567890abcdef12345c42000002a7 \
1234567890abcdef12345c42000003a8 1234567890abcdef12345c42000002a9 \
1234567890abcdef12345c42000002aa 1234567890abcdef12345c42000003ab \
1234567890abcdef12345c42000002ac 1234567890abcdef12345c42000003ad \
1234567890abcdef12345c42000002ae 1234567890abcdef12345c42000002af \
1234567890abcdef12345c42000002b0"
for i in $RDMS; do
LDEV=naa.$i
echo $LDEV
RESERVED="$(esxcli storage core device list -d $LDEV | awk '/Perennially/ {print $4}')"
if [ "$RESERVED" = "false" ]; then
echo " setting prerennially reserved to true"
esxcli storage core device setconfig -d $LDEV --perennially-reserved=true
echo " sleeping $SLEEPTIME"
sleep $SLEEPTIME
echo " setting fixed path"
esxcli storage nmp device set --device $LDEV --psp VMW_PSP_FIXED
else
echo " perennially reserved OK"
fi
done