Two evenings ago I wrote a post about browsers still not having upload progress meters. The blog post was voted up on reddit, and the server got slammed. So slammed, in fact, that it was unusable for a few hours while I investigated the problem. I didn’t know the post was on reddit, but I knew I was getting some traffic. Unfortunately, the day before, I’d installed ‘piwik‘ – a tracking/analytics package. Given that was the most recent change, I spent some time looking at that avenue first.
Actually, I spent most of my time trying to stop Apache so that I could have a usable machine. I’d then make a change, restart Apache, and within 5 seconds the machine would be unusable, and it’d take 2-3 minutes for the hastily typed ‘httpd stop’ command to do its thing. Then I’d start again. I spent some time trying some MySQL tuning options. I shut off my Tomcat processes and a separate SOLR Jetty server process. The machine’s only got 1 gig of RAM, and is ‘only’ a 1.7 ghz processor, so luxuries like Java apps couldn’t be running while I sorted this out.
Finally I realized my APC cache wasn’t on. I’d put it on the server months and months ago, or thought I had. I’d moved servers, and can not remember if I’d reinstalled it or not. The php.ini file had it listed in there, but commented out. Turning it on and restarting didn’t work – APC had been compiled against a different PHP API version. I think what had happened was I’d upgraded from PHP 5.1.3 to 5.2.5 and the internal API structure was different enough that APC didn’t work anymore, and I must have commented it out in the php.ini file. It’s been so long that I finally forgot.
So, trying to recompile APC was an issue in itself, because “pecl install apc” didn’t work. It was already installed as far as pecl was concerned. “pecl uninstall apc”, then “pecl install apc”, then wait for 6 minutes for it to grind all the way up to where it complained about not being able to find ‘apxs’. I had an ‘apxs’ on the machine, and in my haste, I just copied that one in to the path where the compiler could find it. Everything compiled and APC started up. Except it wasn’t caching anything (except itself). The apxs I’d copied was a ‘psa’ one – specific to Plesk’s administative Apache version. So, that took an extra half hour to sort out (actually, I was taken away to something else, so my brother Mark stepped in and recompiled APC for me with a new apxs).
Turning on APC and getting Apache running again solved much of the problems, even though the load on the machine still hovered between 15 and 20 for the next hour or two. By this point much of the traffic spike had died down. I guess people coming to a non-responsive server eventually stop trying!
What I’d ended up doing at the top of the WordPress index.php file was a rand() call to allow 40% of the traffic in during the ‘bad’ times whilst this was being sorted out. That let some people in, and gave the other 60% of visitors a terse “server is under heavy load, try back later” message. Sorry if that was what you experienced yesterday, but it was the best I could do at that point.
We (Mark and I) had spent a bit of time tracing through some xdebug outputs of WordPress logs during this time (I turned on xdebug cachegrind output for about 10 seconds and got quite a few dump files!). WordPress itself is just horribly inefficient under the hood. Somehow I always knew it, but its done for the sake of flexibility. At least that’s what I tell myself. Anyway, I’d spent some valuable time tracing through some of the function calls and I just couldn’t believe how wasteful it is in there. Obviously the APC code cache helps immensely, but that’s only part of the solution. However, given the ecosystem around WordPress, its hard to immediately jump ship to something else.
Anyway, that’s the quick analysis of what happened yesterday. There were over 8000 visits from reddit alone, and a few thousand from various other sources. However, as I said, more than 60% of visitors were getting ‘server overload’ messages, so many of those visitors wouldn’t have been counted.
By the way, the WordPress “super cache” plugin thing did absolutely 0 in the wake of hundreds of concurrent reddit visitors. Not sure if I had set it up wrong, but it looked like there was only one way to set it up. See comment below – I didn’t finish configuring properly – my fault.
Lesson learned, I’m more prepared now than I was a few days ago. I doubt I’ll see that much of a surge again any time soon, but I’m ready for it. Bring it on (in moderation, of course).
On a completely unrelated note, if you’re looking to hire someone, or looking for a new position, check out the new http://webdevjobs.com job board. Thanks.
I take that back – it looks like I hadn’t completed the ‘super cache’ configuration. That might have helped during the overload too, but I was a bit harried trying to look at all possibilities at the same time.
I remember getting a ’server overload’ message. It took me 3 times to hit refresh to get the article. Sorry about that
I never quite understood that a total webserver could die from just 20.000 visits. It should be able to do stuff like that with 2 fingers in it’s nose while whistling a song riding on a one-wheeled bike.
As a programmer, thinking ahead about caching, being /.’ed or something is *the* thing that *always* should be somewhere in the back of your head shouln’t it?
Cache where you can, use static files if you don’t need to update it from db (like, i would cache an entire blogpost’s generated HTML including comments, and throw away the cache and update that only if a new reply is generated or the original blogpost is edited.
*WHAT* is going on inside wordpress that it gets an entire machine on its knees?
Seems to me that the proper thing to do would be to adjust the apache configuration to allow fewer concurrent connections.
Was there a reason you opted to do it in PHP (other than panic? (-: )?
S
I had turned down the Apache max children to about 50, and it was still dying. *1* wordpress hit would eat up 40% CPU and hang around for 3-10 seconds. I guess I could have turned it down to 5 children, but then the loading of graphics and files for other domains would have been impacted a lot.
I also tried changing the theme to a basic one with no images, so a blog view was just 1 html output and one css call. Didn’t help at all.
The problem was everyone hitting that one blog, so stopping the processing on that one blog was the route chosen.
Oh, I’d initially set the rand() factor to only allow 10% of the page visitors in, then upped to 20%, then settled on 40% after a bit of watching the load average levelling out.
WordPress can be horrible for high traffic situations, however the cache plugin helps a great deal. So I wondered why it wouldn’t work for you but I see you already took it back.
Then there’s something else wrong. Perhaps it was hitting the swap?
Caching is the generally accepted method for dealing with high traffic. If you’d be able to get the WP Super Cache plugin working then WordPress can handle significantly higher spikes in traffic.
The server was swapping initially, which didn’t help. I reboot, and didn’t bring back up some services which ate up memory (java apps mostly). The reduced swap, but some Apache processes were still long running and CPU intensive. I believe they were ones were someone was posting a comment and the system was trying to hit Akismet. I eventually turned that off as well.