A colleague noticed that on a Solaris 11 system a Solaris 10 branded zone “gains” two new daemons which are running with UID 16 and 17. Those users are not automatically added to /etc/passwd, /etc/shadow (and /etc/group)… at least not when the zones are imported from an existing Solaris 10 zone.
I added the two users (netadm, netcfg) and the group (netadm) to the Solaris 10 branded zones by hand (copy&paste of the lines in /etc/passwd, /etc/shadow, /etc/group + run pwconv) for our few Solaris 10 branded zones on Solaris 11.
Some weeks ago we installed critical patch updates (CPU) on a Solaris 10 system (internal system, a year of CPU to install, nothing in it affecting us or was considered a security risk, we decided to apply this one regardless to not fall behind too much). Afterwards we noticed that two zones are doing a lot of DNS requests. We noticed this already before the zones went into production and we configured a positive time to live in nscd.conf for “hosts”. Additionally we noticed a lot of DNS requests for IPv6 addresses (AAAA lookups), while absolutely no IPv6 address is configured in the zones (not even for localhost… and those are exclusive IP zones). Apparently with one of the patches in the CPU the behaviour changed regarding the caching, I am not sure if we had the AAAA lookups before.
Today I got some time to debug this. After adding caching of “ipnodes” in addition to “hosts” (and I configured a negative time to live for both at the same time), the DNS requests came down to a sane amount.
For the AAAA lookups I have not found a solution. By my reading of the documentation I would assume there are not IPv6 DNS lookups if there is not IPv6 address configured.
The problem I see at work: A T4-2 with 3 guest LDOMs, virtualized disks and networks lost the complete network connectivity “out of the blue” once, and maybe “sporadic” directly after a cold boot. After a lot of discussion with Oracle, I have the impression that we have two problems here.
Total network loss of the machine (no zone or guest LDOM or the primary LDOM was able to have receive or send IP packets). This happened once. No idea how to reproduce it. In the logs we see the message “[ID 920994 kern.warning] WARNING: vnetX: exceeded number of permitted handshake attempts (5) on channel xxx”. According to Oracle this is supposed to be fixed in 148677 – 01 which will come with Solaris 10u11. They suggested to use a vsw interface instead of a vnet interface on the primary domain to at least lower the probability of this problem hitting us. They were not able to tell us how to reproduce the problem (seems to be a race condition, at least I get this impression based upon the description of the Oracle engineer handling the SR). Only a reboot helped to get the problem solved. I was told we are the only client which reported this kind of problem, the patch for this problem is based upon an internal bugreport from internal tests.
After cold boots sometimes some machines (not all) are not able to connect to an IP on the T4. A reboot helps, as does removing an interface from an aggregate and directly adding it again (see below for the system config). To try to reproduce the problem, we did a lot of warm reboots of the primary domain, and the problem never showed up. We did some cold reboots, and the problem showed up once.
In case someone else sees one of those problems on his machines too, please get in contact with me to see what we have in common to try to track this down further and to share info which may help in maybe reproducing the problems.
- T4-2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on additional network card)
- 3 guest LDOMs and one io+control domain (both in the primary domain)
- the guest LDOMs use SAN disks over the 4 HBAs
- the primary domain uses a mirrored zpool on SSDs
- 5 vswitch in the hypervisor
- 4 aggregates (aggr1 – aggr4 with L2-policy), each one with one igb and one nxge NIC
- each aggregate is connected to a separate vswitch (the 5th vswitch is for machine-internal communication)
- each guest LDOM has three vnets, each vnets connected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone communication, all LDOMs are additionally connected to the machine-internal-only vswitch via the 3rd vnet)
- primary domain uses 2 vnets connected to the vswitch which is connected to aggr2 and aggr3 (consistency with the other LDOMs on this machine) and has no zones
- this means each entity (primary domain, guest LDOMs and each zone) has two vnets in and those two vnets are configured in a link-based IPMP setup (vnet-linkprop=phys-state)
- each vnet has VLAN tagging configured in the hypervisor (with the zones being in different VLANs than the LDOMs)
The proposed change by Oracle is to replace the 2 vnet interfaces in the primary domain with 2 vsw interfaces (which means to do VLAN tagging in the primary domain directly instead of in the vnet config). To have IPMP working this means to have vsw-linkprop=phys-state. We have two systems with the same setup, on one system we already changed this and it is working as before. As we don’t know how to reproduce the 1st problem, we don’t know if the problem is fixed or not, respectively what the probability is to get hit again by this problem.
Ideas / suggestions / info welcome.
I googled a lot regarding the error message “password is not set” when testing a datasource in WebSphere (220.127.116.11), but I did not find a solution. A co-worker finally found a solution (by accident?).
While having the application JVMs running, I created a new JAAS-J2C authenticator (in my case the same login but a different password), and changed the datasource to use the new authenticator. I saved the config and synchronized it. The files config/cells/cellname/nodes/nodename/resources.xml and config/cells/cellname/security.xml showed that the changes arrived on the node. Testing the datasource connectivity fails now with:
DSRA8201W: DataSource Configuration: DSRA8040I: Failed to connect to the DataSource. Encountered java.sql.SQLException: The application server rejected the connection. (Password is not set.)DSRA0010E: SQL State = 08004, Error Code = –99,999.
Restarting the application JVMs does not help.
After stopping everything (application JVMs, nodeagent and deployment manager) and starting everything again, the connection test of the datasource works directly as expected.
I have not tested if it is enough to just stop all application JVMs on one node and the correspding nodeagent, or if I really have to stop the deployment manager too.