The problem I see at work: A T4-2 with 3 guest LDOMs, virtualized disks and networks lost the complete network connectivity “out of the blue” once, and maybe “sporadic” directly after a cold boot. After a lot of discussion with Oracle, I have the impression that we have two problems here.
Total network loss of the machine (no zone or guest LDOM or the primary LDOM was able to have receive or send IP packets). This happened once. No idea how to reproduce it. In the logs we see the message “[ID 920994 kern.warning] WARNING: vnetX: exceeded number of permitted handshake attempts (5) on channel xxx”. According to Oracle this is supposed to be fixed in 148677 – 01 which will come with Solaris 10u11. They suggested to use a vsw interface instead of a vnet interface on the primary domain to at least lower the probability of this problem hitting us. They were not able to tell us how to reproduce the problem (seems to be a race condition, at least I get this impression based upon the description of the Oracle engineer handling the SR). Only a reboot helped to get the problem solved. I was told we are the only client which reported this kind of problem, the patch for this problem is based upon an internal bugreport from internal tests.
After cold boots sometimes some machines (not all) are not able to connect to an IP on the T4. A reboot helps, as does removing an interface from an aggregate and directly adding it again (see below for the system config). To try to reproduce the problem, we did a lot of warm reboots of the primary domain, and the problem never showed up. We did some cold reboots, and the problem showed up once.
In case someone else sees one of those problems on his machines too, please get in contact with me to see what we have in common to try to track this down further and to share info which may help in maybe reproducing the problems.
- T4-2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on additional network card)
- 3 guest LDOMs and one io+control domain (both in the primary domain)
- the guest LDOMs use SAN disks over the 4 HBAs
- the primary domain uses a mirrored zpool on SSDs
- 5 vswitch in the hypervisor
- 4 aggregates (aggr1 — aggr4 with L2-policy), each one with one igb and one nxge NIC
- each aggregate is connected to a separate vswitch (the 5th vswitch is for machine-internal communication)
- each guest LDOM has three vnets, each vnets connected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone communication, all LDOMs are additionally connected to the machine-internal-only vswitch via the 3rd vnet)
- primary domain uses 2 vnets connected to the vswitch which is connected to aggr2 and aggr3 (consistency with the other LDOMs on this machine) and has no zones
- this means each entity (primary domain, guest LDOMs and each zone) has two vnets in and those two vnets are configured in a link-based IPMP setup (vnet-linkprop=phys-state)
- each vnet has VLAN tagging configured in the hypervisor (with the zones being in different VLANs than the LDOMs)
The proposed change by Oracle is to replace the 2 vnet interfaces in the primary domain with 2 vsw interfaces (which means to do VLAN tagging in the primary domain directly instead of in the vnet config). To have IPMP working this means to have vsw-linkprop=phys-state. We have two systems with the same setup, on one system we already changed this and it is working as before. As we don’t know how to reproduce the 1st problem, we don’t know if the problem is fixed or not, respectively what the probability is to get hit again by this problem.
Ideas / suggestions / info welcome.
GD Star Rating
GD Star Rating
Recently I started to reverse engineer a ~10 year old java program (that means it was written at about the same time when I touched java the first and the last time at the university — not because of an dislike of java, but because other programming languages where more suitable for the problems at hand). Actually I am just reverse engineering the GUI applet (the frontend) of a service. The vendor does not exist anymore since about 10 years, the program was not taken over by someone else, and the system where it it used from needs to be updated. The problem, it runs with JRE 1.3. With Java 5 we do not get error messages, but it does not work as it is supposed to be. With Java 6 we get a popup about some values being NULL or 0.
So, first step decompiling all classes of the applet. Second step compiling the result for JRE 1.3 and test if it still works. Third step, modify it to run with Java 6 or 7. Fourth step, be happy.
Well, after decompiling all classes I have now about 1450 source files (~1100 java source code files, the rest are pictures, properties files and maybe other stuff). From initially more than 4000 compile errors I am down to about 600. Well, that are only the compile errors. Bugs in the code (either put there by the decompiler, or by the programmers which wrote this software) are still to be detected. Unfortunately I don’t know if I can just compile a subset of all classes for Java 6/7 and let the rest be compiled for Java 1.3, but I have a test environment where I can play around.
Plan B (searching for a replacement of the application) regarding this is already in progress in parallel. We will see which solution is faster.
GD Star Rating
GD Star Rating
, error messages
, gui applet
, java program
, java source code
, plan b
, programming languages
, reverse engineering
, source files
, test environment
I googled a lot regarding the error message “password is not set” when testing a datasource in WebSphere (220.127.116.11), but I did not find a solution. A co-worker finally found a solution (by accident?).
While having the application JVMs running, I created a new JAAS-J2C authenticator (in my case the same login but a different password), and changed the datasource to use the new authenticator. I saved the config and synchronized it. The files config/cells/cellname/nodes/nodename/resources.xml and config/cells/cellname/security.xml showed that the changes arrived on the node. Testing the datasource connectivity fails now with:
DSRA8201W: DataSource Configuration: DSRA8040I: Failed to connect to the DataSource. Encountered java.sql.SQLException: The application server rejected the connection. (Password is not set.)DSRA0010E: SQL State = 08004, Error Code = –99,999.
Restarting the application JVMs does not help.
After stopping everything (application JVMs, nodeagent and deployment manager) and starting everything again, the connection test of the datasource works directly as expected.
I have not tested if it is enough to just stop all application JVMs on one node and the correspding nodeagent, or if I really have to stop the deployment manager too.
GD Star Rating
GD Star Rating
Tags: accident problem
, co worker
, connection password
, connection test
, deployment manager
, java sql
, problem case
Recently we had a strange performance problem at work. A web application was having slow response times from time to time and users complained. We did not see an uncommon CPU/mem/swap usage on any involved machine. I generated heat-maps from performance measurements and there where no obvious traces of slow behavior. We did not find any reason why the application should be slow for clients, but obviously it was.
Then someone mentioned two recent apache DoS problems. Number one — the cookie hash issue — did not seem to be the cause, we did not see a huge CPU or memory consumption which we would expect to see with such an attack. The second one — the slow reads problem (no max connection duration timeout in apache, can be exploited by a small receive window for TCP) — looked like it could be an issue. The slow read DoS problem can be detected by looking at the server-status page.
What you would see on the server-status page are a lot of worker threads in the ‘W’ (write data) state. This is supposed to be an indication of slow reads. We did see this.
As our site is behind a reverse proxy with some kind of IDS/IPS feature, we took the reverse proxy out of the picture to get a better view of who is doing what (we do not have X-Forwarded-For configured).
At this point we noticed still a lot of connection in the ‘W’ state from the rev-proxy. This was strange, it was not supposed to do this. After restarting the rev-proxy (while the clients went directly to the webservers) we had those ‘W’ entries still in the server-status. This was getting really strange. And to add to this, the duration of the ‘W’ state from the rev-proxy tells that this state is active since several thousand seconds. Ugh. WTF?
Ok, next step: killing the offenders. First I verified in the list of connections in the server-status (extended-status is activated) that all worker threads with the rev–proxy connection of a given PID are in this strange state and no client request is active. Then I killed this particular PID. I wanted to do this until I do not have those strange connections anymore. Unfortunately I arrived at PIDs which were listed in the server-status (even after a refresh), but not available in the OS. That is bad. Very bad.
So the next step was to move all clients away from one webserver, and then to reboot this webserver completely to be sure the entire system is in a known good state for future monitoring (the big hammer approach).
As we did not know if this strange state was due to some kind of mis-administration of the system or not, we decided to have the rev-proxy again in front of the webserver and to monitor the systems.
We survived about one and a half day. After that all worker threads on all webservers where in this state. DoS. At this point we where sure there was something malicious going on (some days later our management showed us a mail from a company which offered security consulting 2 months before to make sure we do not get hit by a DDoS during the holiday season… a coincidence?).
Next step, verification of missing security patches (unfortunately it is not us who decides which patches we apply to the systems). What we noticed is, that the rev-proxy is missing a patch for a DoS problem, and for the webservers a new fixpack was scheduled to be released not far in the future (as of this writing: it is available now).
Since we applied the DoS fix for the rev-proxy, we do not have a problem anymore. This is not really conclusive, as we do not really know if this fixed the problem or if the attacker stopped attacking us.
From reading what the DoS patch fixes, we would assume we should see some continuous traffic going on between the rev-rpoxy and the webserver, but there was nothing when we observed the strange state.
We are still not allowed to apply patches as we think we should do, but at least we have a better monitoring in place to watch out for this particular problem (activate the extended status in apache/IHS, look for lines with state ‘W’ and a long duration (column ‘SS’), raise an alert if the duration is higher than the max. possible/expected/desired duration for all possible URLs).
GD Star Rating
GD Star Rating
Tags: dos problem
, dos problems
, memory consumption
, performance measurements
, performance problem
, proxy connection
, reverse proxy
, slow response times
, swap usage
, worker threads
I was fighting with the right way to add a recent Verisign certificate to a keystore for the IBM HTTP Server (IHS). I have used the ikeyman utility on Solaris.
The problem indicator was the error message “SSL0208E: SSL Handshake Failed, Certificate validation error” in the SSL log of IHS.
The IBM websites where not really helpful to track down the problem (the missing stuff). The Verisign instructions did not lead to a working solution either.
What was done before: the Verisign Intermediate Certificates where imported as “Signer Certificates”, and the certificate for the webserver was imported within “Personal Certificates”. Without the signer certificates the personal certificate would not import due to an intermediate certificated missing (no valid trust-chain).
What I did to resolve the problem:
- I removed all Verisign certificates.
- I added the Verisign Root Certificate and the Verisign Intermediate Certificate A as a signer certificate (use the “Add” button). I also tried to add the Verisign Intermediate Certificate B, but it complained that some part of it was already there as part of the Intermediate Certificate A. I skipped this part.
- Then I converted the server certificate and key to a PKS12 file via “openssl pkcs12 –export –in server-cert.arm –out cert-for-ihs.p12 –inkey server-key.arm –name name_for_cert_in_ihs”.
- After that I imported the cert-for-ihs.p12 as a “Personal Certificate”. The import dialog offers 3 items to import. I selected the “name_for_cert_in_ihs” and the one containing “cn=verisign class 3 public primary certification authority — g5” (when I selected the 3rd one too, it complained that a part of it was already imported with a different name).
With this modified keystore in place, I just had to select the certificate via “SSLServerCert name_for_cert_in_ihs” in the IHS config and the problem was fixed.
GD Star Rating
GD Star Rating
Tags: ibm http server
, intermediate certificate
, intermediate certificates
, personal certificate
, personal certificates
, server cert
, validation error
, verisign certificate
, verisign certificates