Sol­ar­is 10/11(.3) boot panic/​crash after mov­ing rpool to a new stor­age sys­tem

Situ­ation

The boot disks of some Sol­ar­is LDOMs were mi­grated from one stor­age sys­tem to an­oth­er one via ZFS mir­ror­ing the rpool to the new sys­tem and de­tach­ing the old LUN.

Is­sue

After re­boot with on the new stor­age sys­tem Sol­ar­is 10 and 11(.3) pan­ic at boot.

Cause

  • rpool not on slice 0 but on slice 2
  • bug in Sol­ar­is when do­ing such a mir­ror and “just” do­ing a re­boot <- this is the real is­sue, it seems Sol­ar­is can not handle a change of the name of the un­der­ly­ing device for a rpool, as just mov­ing the par­ti­tion­ing to slice 0 is not fix­ing the pan­ic.

Fix

# boot from net­work (or an al­tern­ate pool which was not yet moved), import/​export the pools, boot from the pools
boot net -
# go to shell
# if needed: change the par­ti­tion­ing so that slice 0 has the same val­ues as slice 2 (re­spect­ively make sure the rpool is in slice 0)
zpool im­port -R /​tmp/​yyy rpool
zpool ex­port rpool
re­boot

 

New users in Sol­ar­is 10 branded zones on Sol­ar­is 11 not handled auto­mat­ic­ally

A col­league no­ticed that on a Sol­ar­is 11 sys­tem a Sol­ar­is 10 branded zone “gains” two new dae­mons which are run­ning with UID 16 and 17. Those users are not auto­mat­ic­ally ad­ded to /​etc/​passwd, /​etc/​shadow (and /​etc/​group)… at least not when the zones are im­por­ted from an ex­ist­ing Sol­ar­is 10 zone.

I ad­ded the two users (net­adm, netcfg) and the group (net­adm) to the Sol­ar­is 10 branded zones by hand (copy&paste of the lines in /​etc/​passwd, /​etc/​shadow, /​etc/​group + run pw­conv) for our few Sol­ar­is 10 branded zones on Sol­ar­is 11.

In­crease of DNS re­quests after a crit­ic­al patch up­date of Sol­ar­is 10

Some weeks ago we in­stalled crit­ic­al patch up­dates (CPU) on a Sol­ar­is 10 sys­tem (in­tern­al sys­tem, a year of CPU to in­stall, noth­ing in it af­fect­ing us or was con­sidered a se­cur­ity risk, we de­cided to ap­ply this one re­gard­less to not fall be­hind too much). Af­ter­wards we no­ticed that two zones are do­ing a lot of DNS re­quests. We no­ticed this already be­fore the zones went in­to pro­duc­tion and we con­figured a pos­it­ive time to live in nscd.conf for “hosts”. Ad­di­tion­ally we no­ticed a lot of DNS re­quests for IPv6 ad­dresses (AAAA look­ups), while ab­so­lutely no IPv6 ad­dress is con­figured in the zones (not even for loc­al­host… and those are ex­clus­ive IP zones). Ap­par­ently with one of the patches in the CPU the be­ha­viour changed re­gard­ing the cach­ing, I am not sure if we had the AAAA look­ups be­fore.

Today I got some time to de­bug this. After adding cach­ing of “ipnodes” in ad­di­tion to “hosts” (and I con­figured a neg­at­ive time to live for both at the same time), the DNS re­quests came down to a sane amount.

For the AAAA look­ups I have not found a solu­tion. By my read­ing of the doc­u­ment­a­tion I would as­sume there are not IPv6 DNS look­ups if there is not IPv6 ad­dress con­figured.

Com­plete net­work loss on Sol­ar­is 10u10 CPU 2012-​10 on vir­tu­al­ized T4-​2

The prob­lem I see at work: A T4-​2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nectiv­ity “out of the blue” once, and maybe “sporad­ic” dir­ectly after a cold boot. After a lot of dis­cus­sion with Or­acle, I have the im­pres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the ma­chine (no zone or guest LDOM or the primary LDOM was able to have re­ceive or send IP pack­ets). This happened once. No idea how to re­pro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: ex­ceeded num­ber of per­mit­ted hand­shake at­tempts (5) on chan­nel xxx”. Ac­cord­ing to Or­acle this is sup­posed to be fixed in 148677 – 01 which will come with Sol­ar­is 10u11. They sug­ges­ted to use a vsw in­ter­face in­stead of a vnet in­ter­face on the primary do­main to at least lower the prob­ab­il­ity of this prob­lem hit­ting us. They were not able to tell us how to re­pro­duce the prob­lem (seems to be a race con­di­tion, at least I get this im­pres­sion based upon the de­scrip­tion of the Or­acle en­gin­eer hand­ling the SR). Only a re­boot helped to get the prob­lem solved. I was told we are the only cli­ent which re­por­ted this kind of prob­lem, the patch for this prob­lem is based upon an in­tern­al bu­gre­port from in­tern­al tests.

2nd prob­lem:
After cold boots some­times some ma­chines (not all) are not able to con­nect to an IP on the T4. A re­boot helps, as does re­mov­ing an in­ter­face from an ag­greg­ate and dir­ectly adding it again (see be­low for the sys­tem con­fig). To try to re­pro­duce the prob­lem, we did a lot of warm re­boots of the primary do­main, and the prob­lem nev­er showed up. We did some cold re­boots, and the prob­lem showed up once.

In case someone else sees one of those prob­lems on his ma­chines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe re­pro­du­cing the prob­lems.

Sys­tem setup:

  • T4-​2 with 4 HBAs and 8 NICs (4 * igb on-​board, 4 * nxge on ad­di­tion­al net­work card)
  • 3 guest LDOMs and one io+control do­main (both in the primary do­main)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the primary do­main uses a mirrored zpool on SSDs
  • 5 vswitch in the hy­per­visor
  • 4 ag­greg­ates (aggr1 – aggr4 with L2-​policy), each one with one igb and one nxge NIC
  • each ag­greg­ate is con­nec­ted to a sep­ar­ate vswitch (the 5th vswitch is for machine-​internal com­mu­nic­a­tion)
  • each guest LDOM has three vnets, each vnets con­nec­ted to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have ag­gr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-​zone com­mu­nic­a­tion, all LDOMs are ad­di­tion­ally con­nec­ted to the machine-​internal-​only vswitch via the 3rd vnet)
  • primary do­main uses 2 vnets con­nec­ted to the vswitch which is con­nec­ted to aggr2 and aggr3 (con­sist­ency with the oth­er LDOMs on this ma­chine) and has no zones
  • this means each en­tity (primary do­main, guest LDOMs and each zone) has two vnets in and those two vnets are con­figured in a link-​based IPMP setup (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­figured in the hy­per­visor (with the zones be­ing in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Or­acle is to re­place the 2 vnet in­ter­faces in the primary do­main with 2 vsw in­ter­faces (which means to do VLAN tag­ging in the primary do­main dir­ectly in­stead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same setup, on one sys­tem we already changed this and it is work­ing as be­fore. As we don’t know how to re­pro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, re­spect­ively what the prob­ab­il­ity is to get hit again by this prob­lem.

Ideas /​ sug­ges­tions /​ info wel­come.

Re­verse en­gin­eer­ing a 10 year old java pro­gram

Re­cently I star­ted to re­verse en­gin­eer a ~10 year old java pro­gram (that means it was writ­ten at about the same time when I touched java the first and the last time at the uni­ver­sity – not be­cause of an dis­like of java, but be­cause oth­er pro­gram­ming lan­guages where more suit­able for the prob­lems at hand). Ac­tu­ally I am just re­verse en­gin­eer­ing the GUI ap­plet (the fron­tend) of a ser­vice. The vendor does not ex­ist any­more since about 10 years, the pro­gram was not taken over by someone else, and the sys­tem where it it used from needs to be up­dated. The prob­lem, it runs with JRE 1.3. With Java 5 we do not get er­ror mes­sages, but it does not work as it is sup­posed to be. With Java 6 we get a popup about some val­ues be­ing NULL or 0.

So, first step de­com­pil­ing all classes of the ap­plet. Second step com­pil­ing the res­ult for JRE 1.3 and test if it still works. Third step, modi­fy it to run with Java 6 or 7. Fourth step, be happy.

Well, after de­com­pil­ing all classes I have now about 1450 source files (~1100 java source code files, the rest are pic­tures, prop­er­ties files and maybe oth­er stuff). From ini­tially more than 4000 com­pile er­rors I am down to about 600. Well, that are only the com­pile er­rors. Bugs in the code (either put there by the de­com­piler, or by the pro­gram­mers which wrote this soft­ware) are still to be de­tec­ted. Un­for­tu­nately I don’t know if I can just com­pile a sub­set of all classes for Java 67 and let the rest be com­piled for Java 1.3, but I have a test en­vir­on­ment where I can play around.

Plan B (search­ing for a re­place­ment of the ap­plic­a­tion) re­gard­ing this is already in pro­gress in par­al­lel. We will see which solu­tion is faster.

Web­Sphere 7: solu­tion to “pass­word is not set” while there is a pass­word set

I googled a lot re­gard­ing the er­ror mes­sage “pass­word is not set” when test­ing a data­source in Web­Sphere (7.0.0.21), but I did not find a solu­tion. A co-​worker fi­nally found a solu­tion (by ac­ci­dent?).

Prob­lem case

While hav­ing the ap­plic­a­tion JVMs run­ning, I cre­ated a new JAAS-​J2C au­then­tic­at­or (in my case the same lo­gin but a dif­fer­ent pass­word), and changed the data­source to use the new au­then­tic­at­or. I saved the con­fig and syn­chron­ized it. The files config/​cells/​cell­name/​nodes/​node­name/resources.xml and config/​cells/​cell­name/​se­cur­ity.xml showed that the changes ar­rived on the node. Test­ing the data­source con­nectiv­ity fails now with:

DSRA8201W: Data­Source Con­fig­ur­a­tion: DSRA8040I: Failed to con­nect to the Data­Source.  En­countered java.sql.SQLException: The ap­plic­a­tion serv­er re­jec­ted the con­nec­tion. (Pass­word is not set.)DSRA0010E: SQL State = 08004, Er­ror Code = -99,999.

Re­start­ing the ap­plic­a­tion JVMs does not help.

Solu­tion

After stop­ping everything (ap­plic­a­tion JVMs, nodeagent and de­ploy­ment man­ager) and start­ing everything again, the con­nec­tion test of the data­source works dir­ectly as ex­pec­ted.

I have not tested if it is enough to just stop all ap­plic­a­tion JVMs on one node and the cor­res­p­ding nodeagent, or if I really have to stop the de­ploy­ment man­ager too.

Strange per­form­ance prob­lem with the IBM HTTP Serv­er (mod­i­fied apache)

Re­cently we had a strange per­form­ance prob­lem at work. A web ap­plic­a­tion was hav­ing slow re­sponse times from time to time and users com­plained. We did not see an un­com­mon CPU/​mem/​swap us­age on any in­volved ma­chine. I gen­er­ated heat-​maps from per­form­ance meas­ure­ments and there where no ob­vi­ous traces of slow be­ha­vi­or. We did not find any reas­on why the ap­plic­a­tion should be slow for cli­ents, but ob­vi­ously it was.

Then someone men­tioned two re­cent apache DoS prob­lems. Num­ber one – the cook­ie hash is­sue – did not seem to be the cause, we did not see a huge CPU or memory con­sump­tion which we would ex­pect to see with such an at­tack. The second one – the slow reads prob­lem (no max con­nec­tion dur­a­tion timeout in apache, can be ex­ploited by a small re­ceive win­dow for TCP) – looked like it could be an is­sue. The slow read DoS prob­lem can be de­tec­ted by look­ing at the server-​status page.

What you would see on the server-​status page are a lot of work­er threads in the ‘W’ (write data) state. This is sup­posed to be an in­dic­a­tion of slow reads. We did see this.

As our site is be­hind a re­verse proxy with some kind of IDS/​IPS fea­ture, we took the re­verse proxy out of the pic­ture to get a bet­ter view of who is do­ing what (we do not have X-​Forwarded-​For con­figured).

At this point we no­ticed still a lot of con­nec­tion in the ‘W’ state from the rev-​proxy. This was strange, it was not sup­posed to do this. After re­start­ing the rev-​proxy (while the cli­ents went dir­ectly to the web­serv­ers) we had those ‘W’ entries still in the server-​status. This was get­ting really strange. And to add to this, the dur­a­tion of the ‘W’ state from the rev-​proxy tells that this state is act­ive since sev­er­al thou­sand seconds. Ugh. WTF?

Ok, next step: killing the of­fend­ers. First I veri­fied in the list of con­nec­tions in the server-​status (extended-​status is ac­tiv­ated) that all work­er threads with the rev-​proxy con­nec­tion of a giv­en PID are in this strange state and no cli­ent re­quest is act­ive. Then I killed this par­tic­u­lar PID. I wanted to do this un­til I do not have those strange con­nec­tions any­more. Un­for­tu­nately I ar­rived at PIDs which were lis­ted in the server-​status (even after a re­fresh), but not avail­able in the OS. That is bad. Very bad.

So the next step was to move all cli­ents away from one web­serv­er, and then to re­boot this web­serv­er com­pletely to be sure the en­tire sys­tem is in a known good state for fu­ture mon­it­or­ing (the big ham­mer ap­proach).

As we did not know if this strange state was due to some kind of mis-​administration of the sys­tem or not, we de­cided to have the rev-​proxy again in front of the web­serv­er and to mon­it­or the sys­tems.

We sur­vived about one and a half day. After that all work­er threads on all web­serv­ers where in this state. DoS. At this point we where sure there was some­thing ma­li­cious go­ing on (some days later our man­age­ment showed us a mail from a com­pany which offered se­cur­ity con­sult­ing 2 months be­fore to make sure we do not get hit by a DDoS dur­ing the hol­i­day sea­son… a co­in­cid­ence?).

Next step, veri­fic­a­tion of miss­ing se­cur­ity patches (un­for­tu­nately it is not us who de­cides which patches we ap­ply to the sys­tems). What we no­ticed is, that the rev-​proxy is miss­ing a patch for a DoS prob­lem, and for the web­serv­ers a new fix­pack was sched­uled to be re­leased not far in the fu­ture (as of this writ­ing: it is avail­able now).

Since we ap­plied the DoS fix for the rev-​proxy, we do not have a prob­lem any­more. This is not really con­clus­ive, as we do not really know if this fixed the prob­lem or if the at­tack­er stopped at­tack­ing us.

From read­ing what the DoS patch fixes, we would as­sume we should see some con­tinu­ous traffic go­ing on between the rev-​rpoxy and the web­serv­er, but there was noth­ing when we ob­served the strange state.

We are still not al­lowed to ap­ply patches as we think we should do, but at least we have a bet­ter mon­it­or­ing in place to watch out for this par­tic­u­lar prob­lem (ac­tiv­ate the ex­ten­ded status in apache/​IHS, look for lines with state ‘W’ and a long dur­a­tion (column ‘SS’), raise an alert if the dur­a­tion is high­er than the max. possible/​expected/​desired dur­a­tion for all pos­sible URLs).