Recently we had a strange performance problem at work. A web application was having slow response times from time to time and users complained. We did not see an uncommon CPU/mem/swap usage on any involved machine. I generated heat-maps from performance measurements and there where no obvious traces of slow behavior. We did not find any reason why the application should be slow for clients, but obviously it was.
Then someone mentioned two recent apache DoS problems. Number one — the cookie hash issue — did not seem to be the cause, we did not see a huge CPU or memory consumption which we would expect to see with such an attack. The second one — the slow reads problem (no max connection duration timeout in apache, can be exploited by a small receive window for TCP) — looked like it could be an issue. The slow read DoS problem can be detected by looking at the server-status page.
What you would see on the server-status page are a lot of worker threads in the ‘W’ (write data) state. This is supposed to be an indication of slow reads. We did see this.
As our site is behind a reverse proxy with some kind of IDS/IPS feature, we took the reverse proxy out of the picture to get a better view of who is doing what (we do not have X-Forwarded-For configured).
At this point we noticed still a lot of connection in the ‘W’ state from the rev-proxy. This was strange, it was not supposed to do this. After restarting the rev-proxy (while the clients went directly to the webservers) we had those ‘W’ entries still in the server-status. This was getting really strange. And to add to this, the duration of the ‘W’ state from the rev-proxy tells that this state is active since several thousand seconds. Ugh. WTF?
Ok, next step: killing the offenders. First I verified in the list of connections in the server-status (extended-status is activated) that all worker threads with the rev-proxy connection of a given PID are in this strange state and no client request is active. Then I killed this particular PID. I wanted to do this until I do not have those strange connections anymore. Unfortunately I arrived at PIDs which were listed in the server-status (even after a refresh), but not available in the OS. That is bad. Very bad.
So the next step was to move all clients away from one webserver, and then to reboot this webserver completely to be sure the entire system is in a known good state for future monitoring (the big hammer approach).
As we did not know if this strange state was due to some kind of mis-administration of the system or not, we decided to have the rev-proxy again in front of the webserver and to monitor the systems.
We survived about one and a half day. After that all worker threads on all webservers where in this state. DoS. At this point we where sure there was something malicious going on (some days later our management showed us a mail from a company which offered security consulting 2 months before to make sure we do not get hit by a DDoS during the holiday season… a coincidence?).
Next step, verification of missing security patches (unfortunately it is not us who decides which patches we apply to the systems). What we noticed is, that the rev-proxy is missing a patch for a DoS problem, and for the webservers a new fixpack was scheduled to be released not far in the future (as of this writing: it is available now).
Since we applied the DoS fix for the rev-proxy, we do not have a problem anymore. This is not really conclusive, as we do not really know if this fixed the problem or if the attacker stopped attacking us.
From reading what the DoS patch fixes, we would assume we should see some continuous traffic going on between the rev-rpoxy and the webserver, but there was nothing when we observed the strange state.
We are still not allowed to apply patches as we think we should do, but at least we have a better monitoring in place to watch out for this particular problem (activate the extended status in apache/IHS, look for lines with state ‘W’ and a long duration (column ‘SS’), raise an alert if the duration is higher than the max. possible/expected/desired duration for all possible URLs).
Tags: dos problem, dos problems, memory consumption, performance measurements, performance problem, proxy connection, reverse proxy, slow response times, swap usage, worker threads —