Strange per­for­mance prob­lem with the IBM HTTP Serv­er (mod­i­fied apache)

Recent­ly we had a strange per­for­mance prob­lem at work. A web appli­ca­tion was hav­ing slow response times from time to time and users com­plained. We did not see an uncom­mon CPU/mem/swap usage on any involved machine. I gen­er­at­ed heat-maps from per­for­mance mea­sure­ments and there where no obvi­ous traces of slow behav­ior. We did not find any rea­son why the appli­ca­tion should be slow for clients, but obvi­ous­ly it was.

Then some­one men­tioned two recent apache DoS prob­lems. Num­ber one – the cook­ie hash issue – did not seem to be the cause, we did not see a huge CPU or mem­o­ry con­sump­tion which we would expect to see with such an attack. The sec­ond one – the slow reads prob­lem (no max con­nec­tion dura­tion time­out in apache, can be exploit­ed by a small receive win­dow for TCP) – looked like it could be an issue. The slow read DoS prob­lem can be detect­ed by look­ing at the server-status page.

What you would see on the server-status page are a lot of work­er threads in the ‘W’ (write data) state. This is sup­posed to be an indi­ca­tion of slow reads. We did see this.

As our site is behind a reverse proxy with some kind of IDS/IPS fea­ture, we took the reverse proxy out of the pic­ture to get a bet­ter view of who is doing what (we do not have X‑Forwarded-For configured).

At this point we noticed still a lot of con­nec­tion in the ‘W’ state from the rev-proxy. This was strange, it was not sup­posed to do this. After restart­ing the rev-proxy (while the clients went direct­ly to the web­servers) we had those ‘W’ entries still in the server-status. This was get­ting real­ly strange. And to add to this, the dura­tion of the ‘W’ state from the rev-proxy tells that this state is active since sev­er­al thou­sand sec­onds. Ugh. WTF?

Ok, next step: killing the offend­ers. First I ver­i­fied in the list of con­nec­tions in the server-status (extended-status is acti­vat­ed) that all work­er threads with the rev-proxy con­nec­tion of a giv­en PID are in this strange state and no client request is active. Then I killed this par­tic­u­lar PID. I want­ed to do this until I do not have those strange con­nec­tions any­more. Unfor­tu­nate­ly I arrived at PIDs which were list­ed in the server-status (even after a refresh), but not avail­able in the OS. That is bad. Very bad.

So the next step was to move all clients away from one web­serv­er, and then to reboot this web­serv­er com­plete­ly to be sure the entire sys­tem is in a known good state for future mon­i­tor­ing (the big ham­mer approach).

As we did not know if this strange state was due to some kind of mis-administration of the sys­tem or not, we decid­ed to have the rev-proxy again in front of the web­serv­er and to mon­i­tor the systems.

We sur­vived about one and a half day. After that all work­er threads on all web­servers where in this state. DoS. At this point we where sure there was some­thing mali­cious going on (some days lat­er our man­age­ment showed us a mail from a com­pa­ny which offered secu­ri­ty con­sult­ing 2 months before to make sure we do not get hit by a DDoS dur­ing the hol­i­day sea­son… a coincidence?).

Next step, ver­i­fi­ca­tion of miss­ing secu­ri­ty patch­es (unfor­tu­nate­ly it is not us who decides which patch­es we apply to the sys­tems). What we noticed is, that the rev-proxy is miss­ing a patch for a DoS prob­lem, and for the web­servers a new fix­pack was sched­uled to be released not far in the future (as of this writ­ing: it is avail­able now).

Since we applied the DoS fix for the rev-proxy, we do not have a prob­lem any­more. This is not real­ly con­clu­sive, as we do not real­ly know if this fixed the prob­lem or if the attack­er stopped attack­ing us.

From read­ing what the DoS patch fix­es, we would assume we should see some con­tin­u­ous traf­fic going on between the rev-rpoxy and the web­serv­er, but there was noth­ing when we observed the strange state.

We are still not allowed to apply patch­es as we think we should do, but at least we have a bet­ter mon­i­tor­ing in place to watch out for this par­tic­u­lar prob­lem (acti­vate the extend­ed sta­tus in apache/IHS, look for lines with state ‘W’ and a long dura­tion (col­umn ‘SS’), raise an alert if the dura­tion is high­er than the max. possible/expected/desired dura­tion for all pos­si­ble URLs).