Com­plete net­work loss on Solaris 10u10 CPU 2012-10 on vir­tu­al­ized T4‑2

The prob­lem I see at work: A T4‑2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nec­tiv­i­ty “out of the blue” once, and maybe “spo­radic” direct­ly after a cold boot. After a lot of dis­cus­sion with Ora­cle, I have the impres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the machine (no zone or guest LDOM or the pri­ma­ry LDOM was able to have receive or send IP pack­ets). This hap­pened once. No idea how to repro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: exceed­ed num­ber of per­mit­ted hand­shake attempts (5) on chan­nel xxx”. Accord­ing to Ora­cle this is sup­posed to be fixed in 148677 – 01 which will come with Solaris 10u11. They sug­gest­ed to use a vsw inter­face instead of a vnet inter­face on the pri­ma­ry domain to at least low­er the prob­a­bil­i­ty of this prob­lem hit­ting us. They were not able to tell us how to repro­duce the prob­lem (seems to be a race con­di­tion, at least I get this impres­sion based upon the descrip­tion of the Ora­cle engi­neer han­dling the SR). Only a reboot helped to get the prob­lem solved. I was told we are the only client which report­ed this kind of prob­lem, the patch for this prob­lem is based upon an inter­nal bugre­port from inter­nal tests.

2nd prob­lem:
After cold boots some­times some machines (not all) are not able to con­nect to an IP on the T4. A reboot helps, as does remov­ing an inter­face from an aggre­gate and direct­ly adding it again (see below for the sys­tem con­fig). To try to repro­duce the prob­lem, we did a lot of warm reboots of the pri­ma­ry domain, and the prob­lem nev­er showed up. We did some cold reboots, and the prob­lem showed up once.

In case some­one else sees one of those prob­lems on his machines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe repro­duc­ing the problems.

Sys­tem setup:

  • T4‑2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on addi­tion­al net­work card)
  • 3 guest LDOMs and one io+control domain (both in the pri­ma­ry domain)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the pri­ma­ry domain uses a mir­rored zpool on SSDs
  • 5 vswitch in the hypervisor
  • 4 aggre­gates (aggr1 – aggr4 with L2-policy), each one with one igb and one nxge NIC
  • each aggre­gate is con­nect­ed to a sep­a­rate vswitch (the 5th vswitch is for machine-internal communication)
  • each guest LDOM has three vnets, each vnets con­nect­ed to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone com­mu­ni­ca­tion, all LDOMs are addi­tion­al­ly con­nect­ed to the machine-internal-only vswitch via the 3rd vnet)
  • pri­ma­ry domain uses 2 vnets con­nect­ed to the vswitch which is con­nect­ed to aggr2 and aggr3 (con­sis­ten­cy with the oth­er LDOMs on this machine) and has no zones
  • this means each enti­ty (pri­ma­ry domain, guest LDOMs and each zone) has two vnets in and those two vnets are con­fig­ured in a link-based IPMP set­up (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­fig­ured in the hyper­vi­sor (with the zones being in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Ora­cle is to replace the 2 vnet inter­faces in the pri­ma­ry domain with 2 vsw inter­faces (which means to do VLAN tag­ging in the pri­ma­ry domain direct­ly instead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same set­up, on one sys­tem we already changed this and it is work­ing as before. As we don’t know how to repro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, respec­tive­ly what the prob­a­bil­i­ty is to get hit again by this problem.

Ideas / sug­ges­tions / info welcome.

Incom­pat­i­ble WP plugins

The geosmart plu­g­in is incom­pat­i­ble with the one-time-password (OTP) plu­g­in of Word­Press. The prob­lem is that the OTP plu­g­in does not dis­play the chal­lenge on the login page any­more when the geosmart plu­g­in is activated.

A work around may be to make sure the geosmart plu­g­in does not do some­thing on the login page, but this incom­pat­i­bil­i­ty could also cause prob­lems some­where else.

The prob­lem could be relat­ed to the way the geosmart plu­g­in uses jquery. I found a bug report for OTP where the prob­lem was the jquery han­dling in anoth­er plu­g­in. The spe­cif­ic prob­lem men­tioned there does not seem to be the same as in the geosmart plu­g­in, at least on the very quick look I had.

So… for now I dis­abled the geosmart plu­g­in, most of the time I guessed the sequence num­ber right, but some­times I did not.

Email app from Android 3.1 in Android 3.2?

As pre­vi­ous­ly report­ed, I tried the update to Android 3.2 on my Tab and was not hap­py about the new EMail app. At the week­end I had a lit­tle bit of time, so I tried to get the Email.apk from Android 3.1 into Android 3.2.

Long sto­ry short, I failed.

Tita­ni­um­Back­up PRO was restor­ing or hours (the option to migrate from a dif­fer­ent ROM ver­sion was enabled) until I killed the app, and it did not get any­where (I just emailed their sup­port if I did some­thing com­plete­ly stu­pid, or of this is a bug in TB). And a copy by hand into /system/apps did not work (app fails to start).

Ideas wel­come.

Strange per­for­mance prob­lem with the IBM HTTP Serv­er (mod­i­fied apache)

Recent­ly we had a strange per­for­mance prob­lem at work. A web appli­ca­tion was hav­ing slow response times from time to time and users com­plained. We did not see an uncom­mon CPU/mem/swap usage on any involved machine. I gen­er­at­ed heat-maps from per­for­mance mea­sure­ments and there where no obvi­ous traces of slow behav­ior. We did not find any rea­son why the appli­ca­tion should be slow for clients, but obvi­ous­ly it was.

Then some­one men­tioned two recent apache DoS prob­lems. Num­ber one – the cook­ie hash issue – did not seem to be the cause, we did not see a huge CPU or mem­o­ry con­sump­tion which we would expect to see with such an attack. The sec­ond one – the slow reads prob­lem (no max con­nec­tion dura­tion time­out in apache, can be exploit­ed by a small receive win­dow for TCP) – looked like it could be an issue. The slow read DoS prob­lem can be detect­ed by look­ing at the server-status page.

What you would see on the server-status page are a lot of work­er threads in the ‘W’ (write data) state. This is sup­posed to be an indi­ca­tion of slow reads. We did see this.

As our site is behind a reverse proxy with some kind of IDS/IPS fea­ture, we took the reverse proxy out of the pic­ture to get a bet­ter view of who is doing what (we do not have X‑Forwarded-For configured).

At this point we noticed still a lot of con­nec­tion in the ‘W’ state from the rev-proxy. This was strange, it was not sup­posed to do this. After restart­ing the rev-proxy (while the clients went direct­ly to the web­servers) we had those ‘W’ entries still in the server-status. This was get­ting real­ly strange. And to add to this, the dura­tion of the ‘W’ state from the rev-proxy tells that this state is active since sev­er­al thou­sand sec­onds. Ugh. WTF?

Ok, next step: killing the offend­ers. First I ver­i­fied in the list of con­nec­tions in the server-status (extended-status is acti­vat­ed) that all work­er threads with the rev-proxy con­nec­tion of a giv­en PID are in this strange state and no client request is active. Then I killed this par­tic­u­lar PID. I want­ed to do this until I do not have those strange con­nec­tions any­more. Unfor­tu­nate­ly I arrived at PIDs which were list­ed in the server-status (even after a refresh), but not avail­able in the OS. That is bad. Very bad.

So the next step was to move all clients away from one web­serv­er, and then to reboot this web­serv­er com­plete­ly to be sure the entire sys­tem is in a known good state for future mon­i­tor­ing (the big ham­mer approach).

As we did not know if this strange state was due to some kind of mis-administration of the sys­tem or not, we decid­ed to have the rev-proxy again in front of the web­serv­er and to mon­i­tor the systems.

We sur­vived about one and a half day. After that all work­er threads on all web­servers where in this state. DoS. At this point we where sure there was some­thing mali­cious going on (some days lat­er our man­age­ment showed us a mail from a com­pa­ny which offered secu­ri­ty con­sult­ing 2 months before to make sure we do not get hit by a DDoS dur­ing the hol­i­day sea­son… a coincidence?).

Next step, ver­i­fi­ca­tion of miss­ing secu­ri­ty patch­es (unfor­tu­nate­ly it is not us who decides which patch­es we apply to the sys­tems). What we noticed is, that the rev-proxy is miss­ing a patch for a DoS prob­lem, and for the web­servers a new fix­pack was sched­uled to be released not far in the future (as of this writ­ing: it is avail­able now).

Since we applied the DoS fix for the rev-proxy, we do not have a prob­lem any­more. This is not real­ly con­clu­sive, as we do not real­ly know if this fixed the prob­lem or if the attack­er stopped attack­ing us.

From read­ing what the DoS patch fix­es, we would assume we should see some con­tin­u­ous traf­fic going on between the rev-rpoxy and the web­serv­er, but there was noth­ing when we observed the strange state.

We are still not allowed to apply patch­es as we think we should do, but at least we have a bet­ter mon­i­tor­ing in place to watch out for this par­tic­u­lar prob­lem (acti­vate the extend­ed sta­tus in apache/IHS, look for lines with state ‘W’ and a long dura­tion (col­umn ‘SS’), raise an alert if the dura­tion is high­er than the max. possible/expected/desired dura­tion for all pos­si­ble URLs).

A phoronix bench­mark cre­ates a huge bench­mark­ing discussion

The recent Phoronix bench­mark which com­pared a release can­di­date of FreeB­SD 9 with Ora­cle Lin­ux Serv­er 6.1 cre­at­ed a huge dis­cus­sion in the FreeB­SD mail­inglists. The rea­son was that some peo­ple think the num­bers pre­sent­ed there give a wrong pic­ture of FreeB­SD. Part­ly because not all bench­mark num­bers are pre­sent­ed in the most promi­nent page (as linked above), but only at a dif­fer­ent place. This gives the impres­sion that FreeB­SD is infe­ri­or in this bench­mark while it just puts the focus (for a rea­son, accord­ing to some peo­ple) on a dif­fer­ent part of the bench­mark (to be more spe­cif­ic, blog­bench is doing disk reads and writes in par­al­lel, FreeB­SD gives high­er pri­or­i­ty to writes than to reads, FreeB­SD 9 out­per­forms OLS 6.1 in the writes while OLS 6.1 shines with the reads, and only the reads are pre­sent­ed on the first page). Oth­er com­plaints are that it is told that the default install was used (in this case UFS as the FS), when it was not (ZFS as the FS).

The author of the Phoronix arti­cle par­tic­i­pat­ed in parts of the dis­cus­sion and asked for spe­cif­ic improve­ment sug­ges­tions. A FreeB­SD com­mit­ter seems to be already work­ing to get some issues resolved. What I do not like per­son­al­ly, is that the arti­cle is not updat­ed with a remark that some things pre­sent­ed do not reflect the real­i­ty and a retest is necessary.

As there was much talk in the thread but not much obvi­ous activ­i­ty from our side to resolve some issues, I start­ed to improve the FreeB­SD wiki page about bench­mark­ing so that we are able to point to it in case some­one wants to bench­mark FreeB­SD. Oth­ers already chimed in and improved some things too. It is far from per­fect, some more eyes – and more impor­tant­ly some more fin­gers which add con­tent – are need­ed. Please go to the wiki page and try to help out (if you are afraid to write some­thing in the wiki, please at least tell your sug­ges­tions on a FreeB­SD mail­inglist so that oth­ers can improve the wiki page).

What we need too, is a wiki page about FreeB­SD tun­ing (a first step would be to take the man-page and con­vert it into a wiki page, then to improve it, and then to feed back the changes to the man-page while keep­ing the wiki page to be able to cross ref­er­ence parts from the bench­mark­ing page).

I already told about this in the thread about the Phoronix bench­mark: every­one is wel­come to improve the sit­u­a­tion. Do not talk, write some­thing. No mat­ter if it is an improve­ment to the bench­mark­ing page, tun­ing advise, or a tool which inspects the sys­tem and sug­gests some tun­ing. If you want to help in the wiki, cre­ate a First­name­Last­name account and ask a FreeB­SD comit­ter for write access.

A while ago (IIRC we have to think in months or even years) there was some frame­work for auto­mat­ic FreeB­SD bench­mark­ing. Unfor­tu­nate­ly the author run out of time. The frame­work was able to install a FreeB­SD sys­tem on a machine, run some spec­i­fied bench­mark (not much bench­marks where inte­grat­ed), and then install anoth­er FreeB­SD ver­sion to run the same bench­mark, or to rein­stall the same ver­sion to run anoth­er bench­mark. IIRC there was also some DB behind which col­lect­ed the results and maybe there was even some way to com­pare them. It would be nice if some­one could get some time to talk with the author to get the frame­work and set it up some­where, so that we have a con­trolled envi­ron­ment where we can do our own bench­marks in an auto­mat­ic and repeat­able fash­ion with sev­er­al FreeB­SD versions.