Alexander Leidinger

Just another weblog


Com­plete net­work loss on Solaris 10u10 CPU 2012-10 on vir­tu­al­ized T4-2

The prob­lem I see at work: A T4-2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nec­tiv­ity “out of the blue” once, and maybe “spo­radic” directly after a cold boot. After a lot of dis­cus­sion with Ora­cle, I have the impres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the machine (no zone or guest LDOM or the pri­mary LDOM was able to have receive or send IP pack­ets). This hap­pened once. No idea how to repro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: exceeded num­ber of per­mit­ted hand­shake attempts (5) on chan­nel xxx”. Accord­ing to Ora­cle this is sup­posed to be fixed in 148677 – 01 which will come with Solaris 10u11. They sug­gested to use a vsw inter­face instead of a vnet inter­face on the pri­mary domain to at least lower the prob­a­bil­ity of this prob­lem hit­ting us. They were not able to tell us how to repro­duce the prob­lem (seems to be a race con­di­tion, at least I get this impres­sion based upon the descrip­tion of the Ora­cle engi­neer han­dling the SR). Only a reboot helped to get the prob­lem solved. I was told we are the only client which reported this kind of prob­lem, the patch for this prob­lem is based upon an inter­nal bugre­port from inter­nal tests.

2nd prob­lem:
After cold boots some­times some machines (not all) are not able to con­nect to an IP on the T4. A reboot helps, as does remov­ing an inter­face from an aggre­gate and directly adding it again (see below for the sys­tem con­fig). To try to repro­duce the prob­lem, we did a lot of warm reboots of the pri­mary domain, and the prob­lem never showed up. We did some cold reboots, and the prob­lem showed up once.

In case some­one else sees one of those prob­lems on his machines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe repro­duc­ing the problems.

Sys­tem setup:

  • T4-2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on addi­tional net­work card)
  • 3 guest LDOMs and one io+control domain (both in the pri­mary domain)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the pri­mary domain uses a mir­rored zpool on SSDs
  • 5 vswitch in the hypervisor
  • 4 aggre­gates (aggr1 — aggr4 with L2-policy), each one with one igb and one nxge NIC
  • each aggre­gate is con­nected to a sep­a­rate vswitch (the 5th vswitch is for machine-internal communication)
  • each guest LDOM has three vnets, each vnets con­nected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone com­mu­ni­ca­tion, all LDOMs are addi­tion­ally con­nected to the machine-internal-only vswitch via the 3rd vnet)
  • pri­mary domain uses 2 vnets con­nected to the vswitch which is con­nected to aggr2 and aggr3 (con­sis­tency with the other LDOMs on this machine) and has no zones
  • this means each entity (pri­mary domain, guest LDOMs and each zone) has two vnets in and those two vnets are con­fig­ured in a link-based IPMP setup (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­fig­ured in the hyper­vi­sor (with the zones being in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Ora­cle is to replace the 2 vnet inter­faces in the pri­mary domain with 2 vsw inter­faces (which means to do VLAN tag­ging in the pri­mary domain directly instead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same setup, on one sys­tem we already changed this and it is work­ing as before. As we don’t know how to repro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, respec­tively what the prob­a­bil­ity is to get hit again by this problem.

Ideas / sug­ges­tions / info welcome.

GD Star Rat­ing
GD Star Rat­ing


Incom­pat­i­ble WP plugins

The geosmart plu­gin is incom­pat­i­ble with the one-time-password (OTP) plu­gin of Word­Press. The prob­lem is that the OTP plu­gin does not dis­play the chal­lenge on the login page any­more when the geosmart plu­gin is activated.

A work around may be to make sure the geosmart plu­gin does not do some­thing on the login page, but this incom­pat­i­bil­ity could also cause prob­lems some­where else.

The prob­lem could be related to the way the geosmart plu­gin uses jquery. I found a bug report for OTP where the prob­lem was the jquery han­dling in another plu­gin. The spe­cific prob­lem men­tioned there does not seem to be the same as in the geosmart plu­gin, at least on the very quick look I had.

So… for now I dis­abled the geosmart plu­gin, most of the time I guessed the sequence num­ber right, but some­times I did not.

GD Star Rat­ing
GD Star Rat­ing

Tags: , , , , , ,

Email app from Android 3.1 in Android 3.2?

As pre­vi­ously reported, I tried the update to Android 3.2 on my Tab and was not happy about the new EMail app. At the week­end I had a lit­tle bit of time, so I tried to get the Email.apk from Android 3.1 into Android 3.2.

Long story short, I failed.

Tita­ni­um­Backup PRO was restor­ing or hours (the option to migrate from a dif­fer­ent ROM ver­sion was enabled) until I killed the app, and it did not get any­where (I just emailed their sup­port if I did some­thing com­pletely stu­pid, or of this is a bug in TB). And a copy by hand into /system/apps did not work (app fails to start).

Ideas welcome.

GD Star Rat­ing
GD Star Rat­ing

Tags: , , , , , ,

Strange per­for­mance prob­lem with the IBM HTTP Server (mod­i­fied apache)

Recently we had a strange per­for­mance prob­lem at work. A web appli­ca­tion was hav­ing slow response times from time to time and users com­plained. We did not see an uncom­mon CPU/mem/swap usage on any involved machine. I gen­er­ated heat-maps from per­for­mance mea­sure­ments and there where no obvi­ous traces of slow behav­ior. We did not find any rea­son why the appli­ca­tion should be slow for clients, but obvi­ously it was.

Then some­one men­tioned two recent apache DoS prob­lems. Num­ber one — the cookie hash issue — did not seem to be the cause, we did not see a huge CPU or mem­ory con­sump­tion which we would expect to see with such an attack. The sec­ond one — the slow reads prob­lem (no max con­nec­tion dura­tion time­out in apache, can be exploited by a small receive win­dow for TCP) — looked like it could be an issue. The slow read DoS prob­lem can be detected by look­ing at the server-status page.

What you would see on the server-status page are a lot of worker threads in the ‘W’ (write data) state. This is sup­posed to be an indi­ca­tion of slow reads. We did see this.

As our site is behind a reverse proxy with some kind of IDS/IPS fea­ture, we took the reverse proxy out of the pic­ture to get a bet­ter view of who is doing what (we do not have X-Forwarded-For configured).

At this point we noticed still a lot of con­nec­tion in the ‘W’ state from the rev-proxy. This was strange, it was not sup­posed to do this. After restart­ing the rev-proxy (while the clients went directly to the web­servers) we had those ‘W’ entries still in the server-status. This was get­ting really strange. And to add to this, the dura­tion of the ‘W’ state from the rev-proxy tells that this state is active since sev­eral thou­sand sec­onds. Ugh. WTF?

Ok, next step: killing the offend­ers. First I ver­i­fied in the list of con­nec­tions in the server-status (extended-status is acti­vated) that all worker threads with the rev-proxy con­nec­tion of a given PID are in this strange state and no client request is active. Then I killed this par­tic­u­lar PID. I wanted to do this until I do not have those strange con­nec­tions any­more. Unfor­tu­nately I arrived at PIDs which were listed in the server-status (even after a refresh), but not avail­able in the OS. That is bad. Very bad.

So the next step was to move all clients away from one web­server, and then to reboot this web­server com­pletely to be sure the entire sys­tem is in a known good state for future mon­i­tor­ing (the big ham­mer approach).

As we did not know if this strange state was due to some kind of mis-administration of the sys­tem or not, we decided to have the rev-proxy again in front of the web­server and to mon­i­tor the systems.

We sur­vived about one and a half day. After that all worker threads on all web­servers where in this state. DoS. At this point we where sure there was some­thing mali­cious going on (some days later our man­age­ment showed us a mail from a com­pany which offered secu­rity con­sult­ing 2 months before to make sure we do not get hit by a DDoS dur­ing the hol­i­day sea­son… a coincidence?).

Next step, ver­i­fi­ca­tion of miss­ing secu­rity patches (unfor­tu­nately it is not us who decides which patches we apply to the sys­tems). What we noticed is, that the rev-proxy is miss­ing a patch for a DoS prob­lem, and for the web­servers a new fix­pack was sched­uled to be released not far in the future (as of this writ­ing: it is avail­able now).

Since we applied the DoS fix for the rev-proxy, we do not have a prob­lem any­more. This is not really con­clu­sive, as we do not really know if this fixed the prob­lem or if the attacker stopped attack­ing us.

From read­ing what the DoS patch fixes, we would assume we should see some con­tin­u­ous traf­fic going on between the rev-rpoxy and the web­server, but there was noth­ing when we observed the strange state.

We are still not allowed to apply patches as we think we should do, but at least we have a bet­ter mon­i­tor­ing in place to watch out for this par­tic­u­lar prob­lem (acti­vate the extended sta­tus in apache/IHS, look for lines with state ‘W’ and a long dura­tion (col­umn ‘SS’), raise an alert if the dura­tion is higher than the max. possible/expected/desired dura­tion for all pos­si­ble URLs).

GD Star Rat­ing
GD Star Rat­ing

Tags: , , , , , , , , ,

A phoronix bench­mark cre­ates a huge bench­mark­ing discussion

The recent Phoronix bench­mark which com­pared a release can­di­date of FreeBSD 9 with Ora­cle Linux Server 6.1 cre­ated a huge dis­cus­sion in the FreeBSD mail­inglists. The rea­son was that some peo­ple think the num­bers pre­sented there give a wrong pic­ture of FreeBSD. Partly because not all bench­mark num­bers are pre­sented in the most promi­nent page (as linked above), but only at a dif­fer­ent place. This gives the impres­sion that FreeBSD is infe­rior in this bench­mark while it just puts the focus (for a rea­son, accord­ing to some peo­ple) on a dif­fer­ent part of the bench­mark (to be more spe­cific, blog­bench is doing disk reads and writes in par­al­lel, FreeBSD gives higher pri­or­ity to writes than to reads, FreeBSD 9 out­per­forms OLS 6.1 in the writes while OLS 6.1 shines with the reads, and only the reads are pre­sented on the first page). Other com­plaints are that it is told that the default install was used (in this case UFS as the FS), when it was not (ZFS as the FS).

The author of the Phoronix arti­cle par­tic­i­pated in parts of the dis­cus­sion and asked for spe­cific improve­ment sug­ges­tions. A FreeBSD com­mit­ter seems to be already work­ing to get some issues resolved. What I do not like per­son­ally, is that the arti­cle is not updated with a remark that some things pre­sented do not reflect the real­ity and a retest is necessary.

As there was much talk in the thread but not much obvi­ous activ­ity from our side to resolve some issues, I started to improve the FreeBSD wiki page about bench­mark­ing so that we are able to point to it in case some­one wants to bench­mark FreeBSD. Oth­ers already chimed in and improved some things too. It is far from per­fect, some more eyes — and more impor­tantly some more fin­gers which add con­tent — are needed. Please go to the wiki page and try to help out (if you are afraid to write some­thing in the wiki, please at least tell your sug­ges­tions on a FreeBSD mail­inglist so that oth­ers can improve the wiki page).

What we need too, is a wiki page about FreeBSD tun­ing (a first step would be to take the man-page and con­vert it into a wiki page, then to improve it, and then to feed back the changes to the man-page while keep­ing the wiki page to be able to cross ref­er­ence parts from the bench­mark­ing page).

I already told about this in the thread about the Phoronix bench­mark: every­one is wel­come to improve the sit­u­a­tion. Do not talk, write some­thing. No mat­ter if it is an improve­ment to the bench­mark­ing page, tun­ing advise, or a tool which inspects the sys­tem and sug­gests some tun­ing. If you want to help in the wiki, cre­ate a First­name­Last­name account and ask a FreeBSD comit­ter for write access.

A while ago (IIRC we have to think in months or even years) there was some frame­work for auto­matic FreeBSD bench­mark­ing. Unfor­tu­nately the author run out of time. The frame­work was able to install a FreeBSD sys­tem on a machine, run some spec­i­fied bench­mark (not much bench­marks where inte­grated), and then install another FreeBSD ver­sion to run the same bench­mark, or to rein­stall the same ver­sion to run another bench­mark. IIRC there was also some DB behind which col­lected the results and maybe there was even some way to com­pare them. It would be nice if some­one could get some time to talk with the author to get the frame­work and set it up some­where, so that we have a con­trolled envi­ron­ment where we can do our own bench­marks in an auto­matic and repeat­able fash­ion with sev­eral FreeBSD versions.

GD Star Rat­ing
GD Star Rat­ing

Tags: , , , , , , , , ,