Saturday, May 4, 2013

What has been going on.

TL;DR: We've had trouble with the memory consumption of Apache processes on our Evergreen server since we did our latest update on April 14, 2013. Along the way to figuring this out we've had a few minor detours and fixed another bug. Our breakthrough came when one of us realized that the longer Apache processes run the more memory they were using. We have made changes to our Apache configuration as a mitigation strategy. Basically, we have lowered our MaxRequestsPerChild from 10,000 down to 1,000. This appears to have helped, but only time will tell.

Read on for the gory details....

Since we updated our Evergreen installation on Sunday, April 14, 2013, we've had trouble with what now appear to be memory leaks on the server, specifically involving Apache or some module that we load into it, perhaps Evergreen mod_perl code.

The first time we noticed that we had a problem was when memcached crashed on Tuesday the 16th. We restarted the memcached service, and it crashed again in a day or two. This time, we found a message in the logs indicating that the server's operating system had shut memcached down because the operating system was out of memory. We moved the memcached service to another server, and all seemed well for a few more days.

As the following work week began, we got calls of problems with Evergreen. We checked the server and it seemed busy. We also found a message in the operating system logs warning of a possible SYN flood attack on port 80, the port used by the web server for plain text traffic. We also counted a high number of connections to the server on this and related ports.

We restarted the web server software, and the situation immediately improved. We kept an eye on the server for the rest of the week and restarted services as necessary. At this point, it looked like an attack from outside the network, but we never saw that message about the SYN flood in the logs again. Perhaps the attack, or the appearance of an attack, was a one-time thing. In the end it turned out not to be the cause of our troubles.

The problems occurred again on the morning of Monday, the 29th. This time there were several OpenSRF Listener processes using up 100% of CPU. (With a multicore system it is possible to have several processes using 100% of a CPU.) We restarted Evergreen and related services on the server, and the situation immediately improved. We kept an eye on the number of connections and the processes on the server for the rest of the day, and everything seemed all right from that point of view.

Our problem struck again on Tuesday morning just before 9:00 am. This time, we were watching as it happened, and let it unfold. As the connection count skyrocketed, we also took a look at where the connections were coming from and what state they were in. It turns out what we saw looked like the opposite of a SYN flood. The majority of the HTTP connections were from the server to itself, and they were in a CLOSE_WAIT state. A large number of connections in this state often points to a problem, particularly when your server is talking to itself.

We spent the rest of Tuesday tracking down the source of these connections in the Evergreen source code. It turns out that when Evergreen goes to request a book jacket image, table of contents, or other added content, it first makes a connection to itself at a special URL to run the code to do the retrieval from the added content provider, Syndetics in our case. This extra step makes it easy for the Evergreen server to cache the information of popular titles, so it only has to ask Syndetics for the covers, etc., once rather than every time a patron brings up a record. After the information is retrieved from the cache or over the network from the added content provider, the content is read from the connection to the local server and served up via the TPAC. These connections were not being reused. A new one was created for every added content request. They were also not being shut down, so they were just dangling, kind of like a phone being left off the hook.

"Ah ha!" We thought we had found our smoking gun. These dangling connections must be responsible for the server's memory exhaustion. We figured that they had to be keeping the Apache processes running longer than necessary, and the connections were likely eating up memory themselves for buffers and other things that they need to operate. We made a code branch that added a shut down of the connections when the data was done being retrieved. We put that into production on Tuesday night about 9:30 and expected things to be fixed on Wednesday.

Well, things appeared to be fixed on Wednesday, or at least improved. As far as the CLOSE_WAIT connections were concerned we had orders of magnitude fewer. Instead of thousands we had about 70. We took another look at the code, though, and realized that we could get rid of even more by moving where the shut down happened.

As it turned out, we were missing those connections that didn't return any added content because there was none to retrieve. We made that change and restarted Apache. This change got us down to having only around 10 (and often less) of those broken connections.

Not too long after that, we got some reports of new errors with Evergreen. We took a closer look and realized we needed a "fix for the fix to the fix." We put that in place, restarted Apache again, and it was smooth sailing for the rest of the day.

As always, we shared these fixes with the community, and they have been added into the master branch and are in the 2.4.0 release.

We found out just before 9:00 am on Thursday morning, when the server ran out of memory, and all the swap was in use, that fixing the problem with open connections did not solve our actual problem. After rebooting the Evergreen server, we set about monitoring processes again with an eye out for anything interesting that turned up.

It tuns out that sorting the processes in top by the amount of memory they are using revealed a very interesting feature of our running Evergreen system. This showed an almost linear correlation between how much CPU time (i.e. how long it had run) an Apache had and the amount of memory it consumed. The highest consumers of memory on the system were those Apache processes that had been running the longest. This explained why restarting the Apache processes on the server would free up so much memory, and it gave us a hint at a mitigation strategy that would work.

We checked our Apache configuration and that revealed we were starting 150 listeners with a minimum spare servers of 50 and a maximum spare servers of 100. What we saw at startup was the 150 being started and 50 or so of them shutting down almost instantly. So, we adjusted the StartServers value down to 100. This should reduce process churn at start up time, and it makes sense to have the number of servers that you start set to one of your spare servers values.

We also saw that our Apache listeners were configured to handle 10,000 requests before shutting down (MaxRequestsPerChild). Since it looked like we could save our system by recycling or shutting down the long running Apache processes more frequently, we decided to lower this setting by an order of magnitude. We dropped it from 10,000 down to 1,000. This change would, in theory, cause the Apache processes to shut down long before they had managed to consume hundreds of megabytes of memory.

Monitoring of the system since making the changes has shown the above assumption to be correct. There are times when the Apache processes use more memory, but they seem to get shutdown before the situation gets out of hand. This, of course, does not actually fix the problem of the high memory use, but it does mitigate the effects and is relatively pain free. There is, after all, a slight increase in the load on the server as Apache processes are shut down and started more frequently, but that is still far less than the load spikes as the system runs out of memory and grinds to a screeching halt. Time will tell if this has really alleviated our situation, but so far "Signs point to yes."

No comments:

Post a Comment