The problem was lock contention — we're still on MyISAM and occasionally a query would obtain a read lock on a table and hold it for 30-120 seconds. In that period a query would request a write lock, and after that some other queries would request read locks. But MySQL prioritizes write locks over read, so all of the queries requesting read locks had to wait for the writing process to get its lock and finish its write before they could proceed. Get a few of the problem queries in that queue and everything would repeat.
Luckily I remembered seeing a switch for that behavior — setting
low-priority-updatesprioritizes read locks over write. The writing process still has to wait for the pathological reader to release its lock, but in the meantime other processes get concurrent read locks and go about their business. Load average down to 4-8, which is pretty much all application, and context switching down to levels below what we saw before the update. Yay! I wish I'd made that change ages ago.
I seriously need to start planning an InnoDB migration soon (even with Oracle's purchase of Innobase).