Com­plete net­work loss on Sol­ar­is 10u10 CPU 2012-​10 on vir­tu­al­ized T4-​2

The prob­lem I see at work: A T4-​2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nectiv­ity “out of the blue” once, and maybe “sporad­ic” dir­ectly after a cold boot. After a lot of dis­cus­sion with Or­acle, I have the im­pres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the ma­chine (no zone or guest LDOM or the primary LDOM was able to have re­ceive or send IP pack­ets). This happened once. No idea how to re­pro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: ex­ceeded num­ber of per­mit­ted hand­shake at­tempts (5) on chan­nel xxx”. Ac­cord­ing to Or­acle this is sup­posed to be fixed in 148677 – 01 which will come with Sol­ar­is 10u11. They sug­ges­ted to use a vsw in­ter­face in­stead of a vnet in­ter­face on the primary do­main to at least lower the prob­ab­il­ity of this prob­lem hit­ting us. They were not able to tell us how to re­pro­duce the prob­lem (seems to be a race con­di­tion, at least I get this im­pres­sion based upon the de­scrip­tion of the Or­acle en­gin­eer hand­ling the SR). Only a re­boot helped to get the prob­lem solved. I was told we are the only cli­ent which re­por­ted this kind of prob­lem, the patch for this prob­lem is based upon an in­tern­al bu­gre­port from in­tern­al tests.

2nd prob­lem:
After cold boots some­times some ma­chines (not all) are not able to con­nect to an IP on the T4. A re­boot helps, as does re­mov­ing an in­ter­face from an ag­greg­ate and dir­ectly adding it again (see be­low for the sys­tem con­fig). To try to re­pro­duce the prob­lem, we did a lot of warm re­boots of the primary do­main, and the prob­lem nev­er showed up. We did some cold re­boots, and the prob­lem showed up once.

In case someone else sees one of those prob­lems on his ma­chines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe re­pro­du­cing the prob­lems.

Sys­tem setup:

  • T4-​2 with 4 HBAs and 8 NICs (4 * igb on-​board, 4 * nxge on ad­di­tion­al net­work card)
  • 3 guest LDOMs and one io+control do­main (both in the primary do­main)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the primary do­main uses a mirrored zpool on SSDs
  • 5 vswitch in the hy­per­visor
  • 4 ag­greg­ates (aggr1 – aggr4 with L2-​policy), each one with one igb and one nxge NIC
  • each ag­greg­ate is con­nec­ted to a sep­ar­ate vswitch (the 5th vswitch is for machine-​internal com­mu­nic­a­tion)
  • each guest LDOM has three vnets, each vnets con­nec­ted to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have ag­gr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-​zone com­mu­nic­a­tion, all LDOMs are ad­di­tion­ally con­nec­ted to the machine-​internal-​only vswitch via the 3rd vnet)
  • primary do­main uses 2 vnets con­nec­ted to the vswitch which is con­nec­ted to aggr2 and aggr3 (con­sist­ency with the oth­er LDOMs on this ma­chine) and has no zones
  • this means each en­tity (primary do­main, guest LDOMs and each zone) has two vnets in and those two vnets are con­figured in a link-​based IPMP setup (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­figured in the hy­per­visor (with the zones be­ing in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Or­acle is to re­place the 2 vnet in­ter­faces in the primary do­main with 2 vsw in­ter­faces (which means to do VLAN tag­ging in the primary do­main dir­ectly in­stead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same setup, on one sys­tem we already changed this and it is work­ing as be­fore. As we don’t know how to re­pro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, re­spect­ively what the prob­ab­il­ity is to get hit again by this prob­lem.

Ideas /​ sug­ges­tions /​ info wel­come.

  1. I’m hit­ting this bug as we speak on some of my Sol10 and Sol11 LDoms run­ning on an Or­acle VM 3.1.1.

    You might want to check this top­ic:

    Vir­tu­al Net­work LDC Hand­shake Is­sues Seen When There Are a Large Num­ber of Vir­tu­al Net­work Devices Present



