Alexander Leidinger

Just another weblog

May
26

New users in Solaris 10 branded zones on Solaris 11 not han­dled automatically

A col­league noticed that on a Solaris 11 sys­tem a Solaris 10 branded zone “gains” two new dae­mons which are run­ning with UID 16 and 17. Those users are not auto­mat­i­cally added to /etc/passwd, /etc/shadow (and /etc/group)… at least not when the zones are imported from an exist­ing Solaris 10 zone.

I added the two users (netadm, netcfg) and the group (netadm) to the Solaris 10 branded zones by hand (copy&paste of the lines in /etc/passwd, /etc/shadow, /etc/group + run pwconv) for our few Solaris 10 branded zones on Solaris 11.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share/Save

May
26

Increase of DNS requests after a crit­i­cal patch update of Solaris 10

Some weeks ago we installed crit­i­cal patch updates (CPU) on a Solaris 10 sys­tem (inter­nal sys­tem, a year of CPU to install, noth­ing in it affect­ing us or was con­sid­ered a secu­rity risk, we decided to apply this one regard­less to not fall behind too much). After­wards we noticed that two zones are doing a lot of DNS requests. We noticed this already before the zones went into pro­duc­tion and we con­fig­ured a pos­i­tive time to live in nscd.conf for “hosts”. Addi­tion­ally we noticed a lot of DNS requests for IPv6 addresses (AAAA lookups), while absolutely no IPv6 address is con­fig­ured in the zones (not even for local­host… and those are exclu­sive IP zones). Appar­ently with one of the patches in the CPU the behav­iour changed regard­ing the caching, I am not sure if we had the AAAA lookups before.

Today I got some time to debug this. After adding caching of “ipn­odes” in addi­tion to “hosts” (and I con­fig­ured a neg­a­tive time to live for both at the same time), the DNS requests came down to a sane amount.

For the AAAA lookups I have not found a solu­tion. By my read­ing of the doc­u­men­ta­tion I would assume there are not IPv6 DNS lookups if there is not IPv6 address configured.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Jan
15

Com­plete net­work loss on Solaris 10u10 CPU 2012-10 on vir­tu­al­ized T4-2

The prob­lem I see at work: A T4-2 with 3 guest LDOMs, vir­tu­al­ized disks and net­works lost the com­plete net­work con­nec­tiv­ity “out of the blue” once, and maybe “spo­radic” directly after a cold boot. After a lot of dis­cus­sion with Ora­cle, I have the impres­sion that we have two prob­lems here.

1st prob­lem:
Total net­work loss of the machine (no zone or guest LDOM or the pri­mary LDOM was able to have receive or send IP pack­ets). This hap­pened once. No idea how to repro­duce it. In the logs we see the mes­sage “[ID 920994 kern.warning] WARNING: vnetX: exceeded num­ber of per­mit­ted hand­shake attempts (5) on chan­nel xxx”. Accord­ing to Ora­cle this is sup­posed to be fixed in 148677 – 01 which will come with Solaris 10u11. They sug­gested to use a vsw inter­face instead of a vnet inter­face on the pri­mary domain to at least lower the prob­a­bil­ity of this prob­lem hit­ting us. They were not able to tell us how to repro­duce the prob­lem (seems to be a race con­di­tion, at least I get this impres­sion based upon the descrip­tion of the Ora­cle engi­neer han­dling the SR). Only a reboot helped to get the prob­lem solved. I was told we are the only client which reported this kind of prob­lem, the patch for this prob­lem is based upon an inter­nal bugre­port from inter­nal tests.

2nd prob­lem:
After cold boots some­times some machines (not all) are not able to con­nect to an IP on the T4. A reboot helps, as does remov­ing an inter­face from an aggre­gate and directly adding it again (see below for the sys­tem con­fig). To try to repro­duce the prob­lem, we did a lot of warm reboots of the pri­mary domain, and the prob­lem never showed up. We did some cold reboots, and the prob­lem showed up once.

In case some­one else sees one of those prob­lems on his machines too, please get in con­tact with me to see what we have in com­mon to try to track this down fur­ther and to share info which may help in maybe repro­duc­ing the problems.

Sys­tem setup:

  • T4-2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on addi­tional net­work card)
  • 3 guest LDOMs and one io+control domain (both in the pri­mary domain)
  • the guest LDOMs use SAN disks over the 4 HBAs
  • the pri­mary domain uses a mir­rored zpool on SSDs
  • 5 vswitch in the hypervisor
  • 4 aggre­gates (aggr1 — aggr4 with L2-policy), each one with one igb and one nxge NIC
  • each aggre­gate is con­nected to a sep­a­rate vswitch (the 5th vswitch is for machine-internal communication)
  • each guest LDOM has three vnets, each vnets con­nected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone com­mu­ni­ca­tion, all LDOMs are addi­tion­ally con­nected to the machine-internal-only vswitch via the 3rd vnet)
  • pri­mary domain uses 2 vnets con­nected to the vswitch which is con­nected to aggr2 and aggr3 (con­sis­tency with the other LDOMs on this machine) and has no zones
  • this means each entity (pri­mary domain, guest LDOMs and each zone) has two vnets in and those two vnets are con­fig­ured in a link-based IPMP setup (vnet-linkprop=phys-state)
  • each vnet has VLAN tag­ging con­fig­ured in the hyper­vi­sor (with the zones being in dif­fer­ent VLANs than the LDOMs)

The pro­posed change by Ora­cle is to replace the 2 vnet inter­faces in the pri­mary domain with 2 vsw inter­faces (which means to do VLAN tag­ging in the pri­mary domain directly instead of in the vnet con­fig). To have IPMP work­ing this means to have vsw-linkprop=phys-state. We have two sys­tems with the same setup, on one sys­tem we already changed this and it is work­ing as before. As we don’t know how to repro­duce the 1st prob­lem, we don’t know if the prob­lem is fixed or not, respec­tively what the prob­a­bil­ity is to get hit again by this problem.

Ideas / sug­ges­tions / info welcome.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Aug
10

Reverse engi­neer­ing a 10 year old java program

Recently I started to reverse engi­neer a ~10 year old java pro­gram (that means it was writ­ten at about the same time when I touched java the first and the last time at the uni­ver­sity — not because of an dis­like of java, but because other pro­gram­ming lan­guages where more suit­able for the prob­lems at hand). Actu­ally I am just reverse engi­neer­ing the GUI applet (the fron­tend) of a ser­vice. The ven­dor does not exist any­more since about 10 years, the pro­gram was not taken over by some­one else, and the sys­tem where it it used from needs to be updated. The prob­lem, it runs with JRE 1.3. With Java 5 we do not get error mes­sages, but it does not work as it is sup­posed to be. With Java 6 we get a popup about some val­ues being NULL or 0.

So, first step decom­pil­ing all classes of the applet. Sec­ond step com­pil­ing the result for JRE 1.3 and test if it still works. Third step, mod­ify it to run with Java 6 or 7. Fourth step, be happy.

Well, after decom­pil­ing all classes I have now about 1450 source files (~1100 java source code files, the rest are pic­tures, prop­er­ties files and maybe other stuff). From ini­tially more than 4000 com­pile errors I am down to about 600. Well, that are only the com­pile errors. Bugs in the code (either put there by the decom­piler, or by the pro­gram­mers which wrote this soft­ware) are still to be detected. Unfor­tu­nately I don’t know if I can just com­pile a sub­set of all classes for Java 6/7 and let the rest be com­piled for Java 1.3, but I have a test envi­ron­ment where I can play around.

Plan B (search­ing for a replace­ment of the appli­ca­tion) regard­ing this is already in progress in par­al­lel. We will see which solu­tion is faster.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,
Jul
06

Web­Sphere 7: solu­tion to “pass­word is not set” while there is a pass­word set

I googled a lot regard­ing the error mes­sage “pass­word is not set” when test­ing a data­source in Web­Sphere (7.0.0.21), but I did not find a solu­tion. A co-worker finally found a solu­tion (by accident?).

Prob­lem case

While hav­ing the appli­ca­tion JVMs run­ning, I cre­ated a new JAAS-J2C authen­ti­ca­tor (in my case the same login but a dif­fer­ent pass­word), and changed the data­source to use the new authen­ti­ca­tor. I saved the con­fig and syn­chro­nized it. The files config/cells/cell­name/nodes/node­name/resources.xml and config/cells/cell­name/secu­rity.xml showed that the changes arrived on the node. Test­ing the data­source con­nec­tiv­ity fails now with:

DSRA8201W: Data­Source Con­fig­u­ra­tion: DSRA8040I: Failed to con­nect to the Data­Source.  Encoun­tered java.sql.SQLException: The appli­ca­tion server rejected the con­nec­tion. (Pass­word is not set.)DSRA0010E: SQL State = 08004, Error Code = –99,999.

Restart­ing the appli­ca­tion JVMs does not help.

Solu­tion

After stop­ping every­thing (appli­ca­tion JVMs, nodeagent and deploy­ment man­ager) and start­ing every­thing again, the con­nec­tion test of the data­source works directly as expected.

I have not tested if it is enough to just stop all appli­ca­tion JVMs on one node and the cor­re­spding nodeagent, or if I really have to stop the deploy­ment man­ager too.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,