Strange per­form­ance prob­lem with the IBM HTTP Serv­er (mod­i­fied apache)

Re­cently we had a strange per­form­ance prob­lem at work. A web ap­plic­a­tion was hav­ing slow re­sponse times from time to time and users com­plained. We did not see an un­com­mon CPU/​mem/​swap us­age on any in­volved ma­chine. I gen­er­ated heat-​maps from per­form­ance meas­ure­ments and there where no ob­vi­ous traces of slow be­ha­vi­or. We did not find any reas­on why the ap­plic­a­tion should be slow for cli­ents, but ob­vi­ously it was.

Then someone men­tioned two re­cent apache DoS prob­lems. Num­ber one – the cook­ie hash is­sue – did not seem to be the cause, we did not see a huge CPU or memory con­sump­tion which we would ex­pect to see with such an at­tack. The second one – the slow reads prob­lem (no max con­nec­tion dur­a­tion timeout in apache, can be ex­ploited by a small re­ceive win­dow for TCP) – looked like it could be an is­sue. The slow read DoS prob­lem can be de­tec­ted by look­ing at the server-​status page.

What you would see on the server-​status page are a lot of work­er threads in the ‘W’ (write data) state. This is sup­posed to be an in­dic­a­tion of slow reads. We did see this.

As our site is be­hind a re­verse proxy with some kind of IDS/​IPS fea­ture, we took the re­verse proxy out of the pic­ture to get a bet­ter view of who is do­ing what (we do not have X-​Forwarded-​For con­figured).

At this point we no­ticed still a lot of con­nec­tion in the ‘W’ state from the rev-​proxy. This was strange, it was not sup­posed to do this. After re­start­ing the rev-​proxy (while the cli­ents went dir­ectly to the web­serv­ers) we had those ‘W’ entries still in the server-​status. This was get­ting really strange. And to add to this, the dur­a­tion of the ‘W’ state from the rev-​proxy tells that this state is act­ive since sev­er­al thou­sand seconds. Ugh. WTF?

Ok, next step: killing the of­fend­ers. First I veri­fied in the list of con­nec­tions in the server-​status (extended-​status is ac­tiv­ated) that all work­er threads with the rev-​proxy con­nec­tion of a giv­en PID are in this strange state and no cli­ent re­quest is act­ive. Then I killed this par­tic­u­lar PID. I wanted to do this un­til I do not have those strange con­nec­tions any­more. Un­for­tu­nately I ar­rived at PIDs which were lis­ted in the server-​status (even after a re­fresh), but not avail­able in the OS. That is bad. Very bad.

So the next step was to move all cli­ents away from one web­serv­er, and then to re­boot this web­serv­er com­pletely to be sure the en­tire sys­tem is in a known good state for fu­ture mon­it­or­ing (the big ham­mer ap­proach).

As we did not know if this strange state was due to some kind of mis-​administration of the sys­tem or not, we de­cided to have the rev-​proxy again in front of the web­serv­er and to mon­it­or the sys­tems.

We sur­vived about one and a half day. After that all work­er threads on all web­serv­ers where in this state. DoS. At this point we where sure there was some­thing ma­li­cious go­ing on (some days later our man­age­ment showed us a mail from a com­pany which offered se­cur­ity con­sult­ing 2 months be­fore to make sure we do not get hit by a DDoS dur­ing the hol­i­day sea­son… a co­in­cid­ence?).

Next step, veri­fic­a­tion of miss­ing se­cur­ity patches (un­for­tu­nately it is not us who de­cides which patches we ap­ply to the sys­tems). What we no­ticed is, that the rev-​proxy is miss­ing a patch for a DoS prob­lem, and for the web­serv­ers a new fix­pack was sched­uled to be re­leased not far in the fu­ture (as of this writ­ing: it is avail­able now).

Since we ap­plied the DoS fix for the rev-​proxy, we do not have a prob­lem any­more. This is not really con­clus­ive, as we do not really know if this fixed the prob­lem or if the at­tack­er stopped at­tack­ing us.

From read­ing what the DoS patch fixes, we would as­sume we should see some con­tinu­ous traffic go­ing on between the rev-​rpoxy and the web­serv­er, but there was noth­ing when we ob­served the strange state.

We are still not al­lowed to ap­ply patches as we think we should do, but at least we have a bet­ter mon­it­or­ing in place to watch out for this par­tic­u­lar prob­lem (ac­tiv­ate the ex­ten­ded status in apache/​IHS, look for lines with state ‘W’ and a long dur­a­tion (column ‘SS’), raise an alert if the dur­a­tion is high­er than the max. possible/​expected/​desired dur­a­tion for all pos­sible URLs).

Leave a Reply

Your email address will not be published. Required fields are marked *