The problem I see at work: A T4-2 with 3 guest LDOMs, virtualized disks and networks lost the complete network connectivity “out of the blue” once, and maybe “sporadic” directly after a cold boot. After a lot of discussion with Oracle, I have the impression that we have two problems here.
Total network loss of the machine (no zone or guest LDOM or the primary LDOM was able to have receive or send IP packets). This happened once. No idea how to reproduce it. In the logs we see the message “[ID 920994 kern.warning] WARNING: vnetX: exceeded number of permitted handshake attempts (5) on channel xxx”. According to Oracle this is supposed to be fixed in 148677 – 01 which will come with Solaris 10u11. They suggested to use a vsw interface instead of a vnet interface on the primary domain to at least lower the probability of this problem hitting us. They were not able to tell us how to reproduce the problem (seems to be a race condition, at least I get this impression based upon the description of the Oracle engineer handling the SR). Only a reboot helped to get the problem solved. I was told we are the only client which reported this kind of problem, the patch for this problem is based upon an internal bugreport from internal tests.
After cold boots sometimes some machines (not all) are not able to connect to an IP on the T4. A reboot helps, as does removing an interface from an aggregate and directly adding it again (see below for the system config). To try to reproduce the problem, we did a lot of warm reboots of the primary domain, and the problem never showed up. We did some cold reboots, and the problem showed up once.
In case someone else sees one of those problems on his machines too, please get in contact with me to see what we have in common to try to track this down further and to share info which may help in maybe reproducing the problems.
- T4-2 with 4 HBAs and 8 NICs (4 * igb on-board, 4 * nxge on additional network card)
- 3 guest LDOMs and one io+control domain (both in the primary domain)
- the guest LDOMs use SAN disks over the 4 HBAs
- the primary domain uses a mirrored zpool on SSDs
- 5 vswitch in the hypervisor
- 4 aggregates (aggr1 — aggr4 with L2-policy), each one with one igb and one nxge NIC
- each aggregate is connected to a separate vswitch (the 5th vswitch is for machine-internal communication)
- each guest LDOM has three vnets, each vnets connected to a vswitch (1 guest LDOM has aggr1+2 only for zones (via vnets), 2 guest LDOMs have aggr 3+4 only for zones (via vnets), and all LDOMs have aggr2+3 (via vnets) for global-zone communication, all LDOMs are additionally connected to the machine-internal-only vswitch via the 3rd vnet)
- primary domain uses 2 vnets connected to the vswitch which is connected to aggr2 and aggr3 (consistency with the other LDOMs on this machine) and has no zones
- this means each entity (primary domain, guest LDOMs and each zone) has two vnets in and those two vnets are configured in a link-based IPMP setup (vnet-linkprop=phys-state)
- each vnet has VLAN tagging configured in the hypervisor (with the zones being in different VLANs than the LDOMs)
The proposed change by Oracle is to replace the 2 vnet interfaces in the primary domain with 2 vsw interfaces (which means to do VLAN tagging in the primary domain directly instead of in the vnet config). To have IPMP working this means to have vsw-linkprop=phys-state. We have two systems with the same setup, on one system we already changed this and it is working as before. As we don’t know how to reproduce the 1st problem, we don’t know if the problem is fixed or not, respectively what the probability is to get hit again by this problem.
Ideas / suggestions / info welcome.
GD Star Rating
GD Star Rating
The geosmart plugin is incompatible with the one-time-password (OTP) plugin of WordPress. The problem is that the OTP plugin does not display the challenge on the login page anymore when the geosmart plugin is activated.
A work around may be to make sure the geosmart plugin does not do something on the login page, but this incompatibility could also cause problems somewhere else.
The problem could be related to the way the geosmart plugin uses jquery. I found a bug report for OTP where the problem was the jquery handling in another plugin. The specific problem mentioned there does not seem to be the same as in the geosmart plugin, at least on the very quick look I had.
So… for now I disabled the geosmart plugin, most of the time I guessed the sequence number right, but sometimes I did not.
GD Star Rating
GD Star Rating
Tags: bug report
, sequence number
, time password
As previously reported, I tried the update to Android 3.2 on my Tab and was not happy about the new EMail app. At the weekend I had a little bit of time, so I tried to get the Email.apk from Android 3.1 into Android 3.2.
Long story short, I failed.
TitaniumBackup PRO was restoring or hours (the option to migrate from a different ROM version was enabled) until I killed the app, and it did not get anywhere (I just emailed their support if I did something completely stupid, or of this is a bug in TB). And a copy by hand into /system/apps did not work (app fails to start).
GD Star Rating
GD Star Rating
, little bit
, new email
, rom version
, system apps
Recently we had a strange performance problem at work. A web application was having slow response times from time to time and users complained. We did not see an uncommon CPU/mem/swap usage on any involved machine. I generated heat-maps from performance measurements and there where no obvious traces of slow behavior. We did not find any reason why the application should be slow for clients, but obviously it was.
Then someone mentioned two recent apache DoS problems. Number one — the cookie hash issue — did not seem to be the cause, we did not see a huge CPU or memory consumption which we would expect to see with such an attack. The second one — the slow reads problem (no max connection duration timeout in apache, can be exploited by a small receive window for TCP) — looked like it could be an issue. The slow read DoS problem can be detected by looking at the server-status page.
What you would see on the server-status page are a lot of worker threads in the ‘W’ (write data) state. This is supposed to be an indication of slow reads. We did see this.
As our site is behind a reverse proxy with some kind of IDS/IPS feature, we took the reverse proxy out of the picture to get a better view of who is doing what (we do not have X-Forwarded-For configured).
At this point we noticed still a lot of connection in the ‘W’ state from the rev-proxy. This was strange, it was not supposed to do this. After restarting the rev-proxy (while the clients went directly to the webservers) we had those ‘W’ entries still in the server-status. This was getting really strange. And to add to this, the duration of the ‘W’ state from the rev-proxy tells that this state is active since several thousand seconds. Ugh. WTF?
Ok, next step: killing the offenders. First I verified in the list of connections in the server-status (extended-status is activated) that all worker threads with the rev–proxy connection of a given PID are in this strange state and no client request is active. Then I killed this particular PID. I wanted to do this until I do not have those strange connections anymore. Unfortunately I arrived at PIDs which were listed in the server-status (even after a refresh), but not available in the OS. That is bad. Very bad.
So the next step was to move all clients away from one webserver, and then to reboot this webserver completely to be sure the entire system is in a known good state for future monitoring (the big hammer approach).
As we did not know if this strange state was due to some kind of mis-administration of the system or not, we decided to have the rev-proxy again in front of the webserver and to monitor the systems.
We survived about one and a half day. After that all worker threads on all webservers where in this state. DoS. At this point we where sure there was something malicious going on (some days later our management showed us a mail from a company which offered security consulting 2 months before to make sure we do not get hit by a DDoS during the holiday season… a coincidence?).
Next step, verification of missing security patches (unfortunately it is not us who decides which patches we apply to the systems). What we noticed is, that the rev-proxy is missing a patch for a DoS problem, and for the webservers a new fixpack was scheduled to be released not far in the future (as of this writing: it is available now).
Since we applied the DoS fix for the rev-proxy, we do not have a problem anymore. This is not really conclusive, as we do not really know if this fixed the problem or if the attacker stopped attacking us.
From reading what the DoS patch fixes, we would assume we should see some continuous traffic going on between the rev-rpoxy and the webserver, but there was nothing when we observed the strange state.
We are still not allowed to apply patches as we think we should do, but at least we have a better monitoring in place to watch out for this particular problem (activate the extended status in apache/IHS, look for lines with state ‘W’ and a long duration (column ‘SS’), raise an alert if the duration is higher than the max. possible/expected/desired duration for all possible URLs).
GD Star Rating
GD Star Rating
Tags: dos problem
, dos problems
, memory consumption
, performance measurements
, performance problem
, proxy connection
, reverse proxy
, slow response times
, swap usage
, worker threads
The recent Phoronix benchmark which compared a release candidate of FreeBSD 9 with Oracle Linux Server 6.1 created a huge discussion in the FreeBSD mailinglists. The reason was that some people think the numbers presented there give a wrong picture of FreeBSD. Partly because not all benchmark numbers are presented in the most prominent page (as linked above), but only at a different place. This gives the impression that FreeBSD is inferior in this benchmark while it just puts the focus (for a reason, according to some people) on a different part of the benchmark (to be more specific, blogbench is doing disk reads and writes in parallel, FreeBSD gives higher priority to writes than to reads, FreeBSD 9 outperforms OLS 6.1 in the writes while OLS 6.1 shines with the reads, and only the reads are presented on the first page). Other complaints are that it is told that the default install was used (in this case UFS as the FS), when it was not (ZFS as the FS).
The author of the Phoronix article participated in parts of the discussion and asked for specific improvement suggestions. A FreeBSD committer seems to be already working to get some issues resolved. What I do not like personally, is that the article is not updated with a remark that some things presented do not reflect the reality and a retest is necessary.
As there was much talk in the thread but not much obvious activity from our side to resolve some issues, I started to improve the FreeBSD wiki page about benchmarking so that we are able to point to it in case someone wants to benchmark FreeBSD. Others already chimed in and improved some things too. It is far from perfect, some more eyes — and more importantly some more fingers which add content — are needed. Please go to the wiki page and try to help out (if you are afraid to write something in the wiki, please at least tell your suggestions on a FreeBSD mailinglist so that others can improve the wiki page).
What we need too, is a wiki page about FreeBSD tuning (a first step would be to take the man-page and convert it into a wiki page, then to improve it, and then to feed back the changes to the man-page while keeping the wiki page to be able to cross reference parts from the benchmarking page).
I already told about this in the thread about the Phoronix benchmark: everyone is welcome to improve the situation. Do not talk, write something. No matter if it is an improvement to the benchmarking page, tuning advise, or a tool which inspects the system and suggests some tuning. If you want to help in the wiki, create a FirstnameLastname account and ask a FreeBSD comitter for write access.
A while ago (IIRC we have to think in months or even years) there was some framework for automatic FreeBSD benchmarking. Unfortunately the author run out of time. The framework was able to install a FreeBSD system on a machine, run some specified benchmark (not much benchmarks where integrated), and then install another FreeBSD version to run the same benchmark, or to reinstall the same version to run another benchmark. IIRC there was also some DB behind which collected the results and maybe there was even some way to compare them. It would be nice if someone could get some time to talk with the author to get the framework and set it up somewhere, so that we have a controlled environment where we can do our own benchmarks in an automatic and repeatable fashion with several FreeBSD versions.
GD Star Rating
GD Star Rating
Tags: benchmark numbers
, improvement suggestions
, linux server
, oracle linux
, release candidate