What caused that load spike?
February 13, 2007 – 4:48 pmEvery now and then, we find that we will have a sudden increase in the number of apache processes, load average will spike up, and then go back down to normal. In rare cases, we will see the same thing happen, and the load avg spike WAY up, all queries appear locked up, and the server must be rebooted. I am looking for ways of determining what caused this. I should note that it happens extremely rarely, and has never shown up in a load test.
On the MySQL end, I use show processlist to try to figure out what’s causing the issue. However, sometimes there’s just 150 queries in there doing nothing (occasionally just selects). I’m guessing it’s either a locking issue or perhaps it’s an issue with to much disk access causing the problem.
On the web server end, it’s a little more difficult. Ideally I’d like to know what url was originally called to create the hung apache process - does anyone know how to figure this out? Running on Fedora Core release 5 (Bordeaux).



2 Responses to “What caused that load spike?”
May be a bit late, and I’m guessing you’ve sorted this by now. But hey, someone might stumble upon it and find it useful. mod_log_forensic is very useful for tracking down web requests taking their sweet time. If they end up crashing, mod_whatkilledus might be of help as well.
As for the database server, you could run `iostat` to see if the disks are working overly much or `vmstat` to see if you’re swapping a lot (horrendous for a DB). Most likely the apache threads are a knock-on effect of the DB being locked up. Could also have been corrupt tables or indices. But most likely it’s a locking issue, difficult to say without the queries or knowing what storage engines etc are involved.
If you ever figured out what it was, I’d be interested in hearing what it was!
Good blog btw!
Erik
By Erik on Feb 23, 2008
Hey Erik,
We ended up finding a few queries that were causing several simultaneous full table scans. Other queries that would normally have been very quick (1 row, based on primary key) would backlog. Due to the number of mysql connections (I think it was around 700) it was difficult to pin down the query in question, but eventually used the mysqldumpslow tool on the slow query log to find the cause.
By jon on Feb 23, 2008