Or: How to Debug 503’s, H12’s, and H13’s on Heroku
TLDR: I was wrong about log-runtime-entries! Turns out my unexplained dyno crashes were caused by a Gem that was misbehaving in its process forking action, which I’ve now fixed and submitted a pull request for.
Introduction and Apology
In my last post, I mentioned that I would keep an eye on my Heroku Rails app over the next 24 hours to see if removing log-runtime-metrics would indeed alleviate my problem with the unexplained dyno reboots that were causing 503, H12, and H13 errors.
Unfortunately, like a bad zombie movie, the problem just wouldn’t die. Overnight my dynos again suffered a reboot, and I again saw the dreaded ‘[FATAL] Failed to create timer thread (errno: 11)’ in the logs.
So, the problem appears not to be log-runtime-metrics after all. My apologies for raising the alarm too early — I should have waited the full 24 hours first.
Continuing the Search for the Real Killers
So what the heck is the actual problem? Something in the app is leaking threads, so how do I figure out where?
To verify my leaky thread hypothesis, I whipped up a short bash script which I then ran inside the (Warning: Big file) Ubuntu Lucid 64 Vagrant box where I develop my Rails app.
while true | |
do | |
# Command line inspired from http://askubuntu.com/a/88983 | |
ps -u vagrant -o thcount | tail -n +2 | awk '{ num_threads += $1 } END { print num_threads }' | |
sleep 1 | |
done |
I use the app for a while, and sure enough, I see the following scroll by:
43, 43 ... 45, 45 ... 44, 44 ... 46, 46 ... 45, 45 ... 47 ...
and I watch the count slowly increase, 2 steps forward and one step back, 2 steps forward and one step back. At this rate the number of threads being used would surely bust a limit given some time. It looks like I finally found the trail of blood. Something in my app was definitely creating threads and not cleaning up after itself.
To see if I could gain further insight, I took a closer look at the process list:
ZOMG! Zombie Processes!
Wow, so this is not only a thread leak, this is a process leak! Now I know that something in the app is forking processes and then failing to wait for the child process to complete. Over time this will result in thousands of Zombie Processes, causing a Zombie Process Apocalypse to wreak havoc on your dyno. (For a great discussion of what exactly a Zombie Process is, see this blog post by the Phusion Passenger team — look for “Not all frozen processes are made equal”)
I know that nothing in my Rails app is so clever as to try and fork processes (really bad idea), so the culprit has to be inside a Gem. After (a while) narrowing down some of the suspects, I finally found that the rack-streaming-proxy gem that I was using was forking processes, and indeed the parent processes were not waiting for the child processes.
I finally found the root cause of the Zombie Process Apocalypse.
Unfortunately this is a very old gem and is not maintained anymore, but I went ahead and fixed the bug, and submitted a pull request. The details of how the bug fix works is very interesting in its own right, and I may write a follow-up post about it.
But for now, in my experiments using the short bash script above, I’ve found that the fix works at making sure no more zombie processes are generated.
Overall this was definitely one of the hardest things I’ve had to debug; it’s one of those problems that only shows up at scale and after running for a long period of time. But at least for now, the Zombie Process Apocalypse has been averted!
Unabashed Self-Promotion
Got problems like these? I am a 7-year Ruby on Rails veteran, who previously founded (and subsequently exited) a startup as a CTO. I am available for short- and long-term consulting contracts involving CTO-level advice and Ruby on Rails project work. Please free free to get in touch.