OpenPGP cryp­to cards ordered

I wrote in a pre­vi­ous blog post that I want to switch to cryp­to cards for use with ssh and GnuPG. After some research I set­tled on the OpenPGP cry­to cards. I ordered them from ker­nel­con­cepts. As soon as they arrive (and I have some free time), I will start to use them and write down how to work with them with FreeBSD.

Com­plete net­work loss on Solaris 10u10 CPU 2012-10 on vir­tu­al­ized T4‑2

The prob­lem I see at work: A T4‑2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nec­tiv­i­ty “out of the blue” once, and maybe “spo­radic” direct­ly after a cold boot. After a lot of dis­cus­sion with Ora­cle, I have the impres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the machine (no zone or guest LDOM or the pri­ma­ry LDOM was able to have receive or send IP pack­ets). This hap­pened once. No idea how to repro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: exceed­ed num­ber of per­mit­ted hand­shake attempts (5) on chan­nel xxx”. Accord­ing to Ora­cle this is sup­posed to be fixed in 148677 – 01 which will come with Solaris 10u11. They sug­gest­ed to use a vsw inter­face instead of a vnet inter­face on the pri­ma­ry domain to at least low­er the prob­a­bil­i­ty of this prob­lem hit­ting us. They were not able to tell us how to repro­duce the prob­lem (seems to be a race con­di­tion, at least I get this impres­sion based upon the descrip­tion of the Ora­cle engi­neer han­dling the SR). Only a reboot helped to get the prob­lem solved. I was told we are the only client which report­ed this kind of prob­lem, the patch for this prob­lem is based upon an inter­nal bugre­port from inter­nal tests.

2nd prob­lem:
After cold boots some­times some machines (not all) are not able to con­nect to an IP on the T4. A reboot helps, as does remov­ing an inter­face from an aggre­gate and direct­ly adding it again (see below for the sys­tem con­fig). To try to repro­duce the prob­lem, we did a lot of warm reboots of the pri­ma­ry domain, and the prob­lem nev­er showed up. We did some cold reboots, and the prob­lem showed up once.

In case some­one else sees one of those prob­lems on his machines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe repro­duc­ing the problems.

Sys­tem setup:

  • T4‑2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on addi­tion­al net­work card)
  • 3 guest LDOMs and one io+control domain (both in the pri­ma­ry domain)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the pri­ma­ry domain uses a mir­rored zpool on SSDs
  • 5 vswitch in the hypervisor
  • 4 aggre­gates (aggr1 – aggr4 with L2-policy), each one with one igb and one nxge NIC
  • each aggre­gate is con­nect­ed to a sep­a­rate vswitch (the 5th vswitch is for machine-internal communication)
  • each guest LDOM has three vnets, each vnets con­nect­ed to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone com­mu­ni­ca­tion, all LDOMs are addi­tion­al­ly con­nect­ed to the machine-internal-only vswitch via the 3rd vnet)
  • pri­ma­ry domain uses 2 vnets con­nect­ed to the vswitch which is con­nect­ed to aggr2 and aggr3 (con­sis­ten­cy with the oth­er LDOMs on this machine) and has no zones
  • this means each enti­ty (pri­ma­ry domain, guest LDOMs and each zone) has two vnets in and those two vnets are con­fig­ured in a link-based IPMP set­up (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­fig­ured in the hyper­vi­sor (with the zones being in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Ora­cle is to replace the 2 vnet inter­faces in the pri­ma­ry domain with 2 vsw inter­faces (which means to do VLAN tag­ging in the pri­ma­ry domain direct­ly instead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same set­up, on one sys­tem we already changed this and it is work­ing as before. As we don’t know how to repro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, respec­tive­ly what the prob­a­bil­i­ty is to get hit again by this problem.

Ideas / sug­ges­tions / info welcome.