Alexander Leidinger

Just another weblog

Jan
24

Strange per­for­mance prob­lem with the IBM HTTP Server (mod­i­fied apache)

Recently we had a strange per­for­mance prob­lem at work. A web appli­ca­tion was hav­ing slow response times from time to time and users com­plained. We did not see an uncom­mon CPU/mem/swap usage on any involved machine. I gen­er­ated heat-maps from per­for­mance mea­sure­ments and there where no obvi­ous traces of slow behav­ior. We did not find any rea­son why the appli­ca­tion should be slow for clients, but obvi­ously it was.

Then some­one men­tioned two recent apache DoS prob­lems. Num­ber one — the cookie hash issue — did not seem to be the cause, we did not see a huge CPU or mem­ory con­sump­tion which we would expect to see with such an attack. The sec­ond one — the slow reads prob­lem (no max con­nec­tion dura­tion time­out in apache, can be exploited by a small receive win­dow for TCP) — looked like it could be an issue. The slow read DoS prob­lem can be detected by look­ing at the server-status page.

What you would see on the server-status page are a lot of worker threads in the ‘W’ (write data) state. This is sup­posed to be an indi­ca­tion of slow reads. We did see this.

As our site is behind a reverse proxy with some kind of IDS/IPS fea­ture, we took the reverse proxy out of the pic­ture to get a bet­ter view of who is doing what (we do not have X-Forwarded-For configured).

At this point we noticed still a lot of con­nec­tion in the ‘W’ state from the rev-proxy. This was strange, it was not sup­posed to do this. After restart­ing the rev-proxy (while the clients went directly to the web­servers) we had those ‘W’ entries still in the server-status. This was get­ting really strange. And to add to this, the dura­tion of the ‘W’ state from the rev-proxy tells that this state is active since sev­eral thou­sand sec­onds. Ugh. WTF?

Ok, next step: killing the offend­ers. First I ver­i­fied in the list of con­nec­tions in the server-status (extended-status is acti­vated) that all worker threads with the rev–proxy con­nec­tion of a given PID are in this strange state and no client request is active. Then I killed this par­tic­u­lar PID. I wanted to do this until I do not have those strange con­nec­tions any­more. Unfor­tu­nately I arrived at PIDs which were listed in the server-status (even after a refresh), but not avail­able in the OS. That is bad. Very bad.

So the next step was to move all clients away from one web­server, and then to reboot this web­server com­pletely to be sure the entire sys­tem is in a known good state for future mon­i­tor­ing (the big ham­mer approach).

As we did not know if this strange state was due to some kind of mis-administration of the sys­tem or not, we decided to have the rev-proxy again in front of the web­server and to mon­i­tor the systems.

We sur­vived about one and a half day. After that all worker threads on all web­servers where in this state. DoS. At this point we where sure there was some­thing mali­cious going on (some days later our man­age­ment showed us a mail from a com­pany which offered secu­rity con­sult­ing 2 months before to make sure we do not get hit by a DDoS dur­ing the hol­i­day sea­son… a coincidence?).

Next step, ver­i­fi­ca­tion of miss­ing secu­rity patches (unfor­tu­nately it is not us who decides which patches we apply to the sys­tems). What we noticed is, that the rev-proxy is miss­ing a patch for a DoS prob­lem, and for the web­servers a new fix­pack was sched­uled to be released not far in the future (as of this writ­ing: it is avail­able now).

Since we applied the DoS fix for the rev-proxy, we do not have a prob­lem any­more. This is not really con­clu­sive, as we do not really know if this fixed the prob­lem or if the attacker stopped attack­ing us.

From read­ing what the DoS patch fixes, we would assume we should see some con­tin­u­ous traf­fic going on between the rev-rpoxy and the web­server, but there was noth­ing when we observed the strange state.

We are still not allowed to apply patches as we think we should do, but at least we have a bet­ter mon­i­tor­ing in place to watch out for this par­tic­u­lar prob­lem (acti­vate the extended sta­tus in apache/IHS, look for lines with state ‘W’ and a long dura­tion (col­umn ‘SS’), raise an alert if the dura­tion is higher than the max. possible/expected/desired dura­tion for all pos­si­ble URLs).

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…
Share/Save

Tags: , , , , , , , , ,

No Responses to “Strange per­for­mance prob­lem with the IBM HTTP Server (mod­i­fied apache)”

Leave a Reply