Beware of the Zombie (Process) Apocalypse

April 26, 2013

DevOps, Heroku, Linux, Programming, Ruby on Rails, Unix

Or: How to Debug 503’s, H12’s, and H13’s on Heroku

TLDR: I was wrong about log-runtime-entries! Turns out my unexplained dyno crashes were caused by a Gem that was misbehaving in its process forking action, which I’ve now fixed and submitted a pull request for.

Introduction and Apology

In my last post, I mentioned that I would keep an eye on my Heroku Rails app over the next 24 hours to see if removing log-runtime-metrics would indeed alleviate my problem with the unexplained dyno reboots that were causing 503, H12, and H13 errors.

Unfortunately, like a bad zombie movie, the problem just wouldn’t die. Overnight my dynos again suffered a reboot, and I again saw the dreaded ‘[FATAL] Failed to create timer thread (errno: 11)’ in the logs.

So, the problem appears not to be log-runtime-metrics after all. My apologies for raising the alarm too early — I should have waited the full 24 hours first.

Continuing the Search for the Real Killers

So what the heck is the actual problem? Something in the app is leaking threads, so how do I figure out where?

To verify my leaky thread hypothesis, I whipped up a short bash script which I then ran inside the (Warning: Big file) Ubuntu Lucid 64 Vagrant box where I develop my Rails app.

	while true
	do
	# Command line inspired from http://askubuntu.com/a/88983
	ps -u vagrant -o thcount \| tail -n +2 \| awk '{ num_threads += $1 } END { print num_threads }'
	sleep 1
	done

view raw gistfile1.txt hosted with ❤ by GitHub

I use the app for a while, and sure enough, I see the following scroll by:

43, 43 ... 45, 45 ... 44, 44 ... 46, 46 ... 45, 45 ... 47 ...

and I watch the count slowly increase, 2 steps forward and one step back, 2 steps forward and one step back. At this rate the number of threads being used would surely bust a limit given some time. It looks like I finally found the trail of blood. Something in my app was definitely creating threads and not cleaning up after itself.

To see if I could gain further insight, I took a closer look at the process list:

ZOMG! Zombie Processes!

Wow, so this is not only a thread leak, this is a process leak! Now I know that something in the app is forking processes and then failing to wait for the child process to complete. Over time this will result in thousands of Zombie Processes, causing a Zombie Process Apocalypse to wreak havoc on your dyno. (For a great discussion of what exactly a Zombie Process is, see this blog post by the Phusion Passenger team — look for “Not all frozen processes are made equal”)

zombie cats-01

I know that nothing in my Rails app is so clever as to try and fork processes (really bad idea), so the culprit has to be inside a Gem. After (a while) narrowing down some of the suspects, I finally found that the rack-streaming-proxy gem that I was using was forking processes, and indeed the parent processes were not waiting for the child processes.

I finally found the root cause of the Zombie Process Apocalypse.

Unfortunately this is a very old gem and is not maintained anymore, but I went ahead and fixed the bug, and submitted a pull request. The details of how the bug fix works is very interesting in its own right, and I may write a follow-up post about it.

But for now, in my experiments using the short bash script above, I’ve found that the fix works at making sure no more zombie processes are generated.

Overall this was definitely one of the hardest things I’ve had to debug; it’s one of those problems that only shows up at scale and after running for a long period of time. But at least for now, the Zombie Process Apocalypse has been averted!

Unabashed Self-Promotion

Got problems like these? I am a 7-year Ruby on Rails veteran, who previously founded (and subsequently exited) a startup as a CTO. I am available for short- and long-term consulting contracts involving CTO-level advice and Ruby on Rails project work. Please free free to get in touch.

Heroku’s log-runtime-metrics Can Crash Your Rails App

April 25, 2013

DevOps, Heroku, Programming, Ruby on Rails

7 Comments

Update 2: I was wrong about log-runtime-metrics! Turns out the culprit was a concurrency bug inside a Gem. Read how I fixed it here.
Update 1: Hacker News discussion here.

TLDR: ~~As of April 25, 2013, do not run log-runtime-metrics for extended periods of time, as it can crash your Heroku dynos (and cause them to reboot) multiple times daily.~~ I was wrong! Please see Beware of the Zombie (Process) Apocalypse

Introduction

In the wake of the Rap Genius brouhaha, Heroku has been on a tear releasing many new features intended to help you tune and debug your Rails app’s performance. One of them is the Heroku Labs feature log-runtime-metrics, which combined with the Log2Viz feature gives you great insight into your app’s memory usage. You get a detailed breakdown of memory usage per dyno in your logs, and armed with this information you can scale your Unicorn worker count accordingly.

Emerging Problems

With this kind of amazing visibility into your dynos, it is tempting to leave this Labs feature enabled. After all, what can go wrong? Well, it turns out, lots. After having this feature enabled for a few days, I started noticing some unexplained dyno reboots inside NewRelic.

unexplained-heroku-dyno-reboots

These restarts were causing 503 timeouts, with Heroku H13 errors manifesting in the logs. (NB: In my case, I am running Unicorn with multiple processes, so each of these instance restarts is actually a single Unicorn worker restarting.)

Investigating the Mystery

Digging deeper into the mystery, I went into the Logentries add-on and examined the Heroku logs around that time. Specifically, I was not receiving any exceptions inside AirBrake so I knew it had to be a middleware problem or a system-level kind of problem.

So let’s take a look at the logs. When things are running fine, you can see that log-runtime-metrics outputs your dyno’s stats every 20 seconds or so.

(As an aside, I am using a 2X dyno, so my memory usage of ~811 MB is not a problem — in fact, it looks like I have enough memory to add another Unicorn worker. Again, log-runtime-metrics is great.)

Around 04:11:21 things start to go wrong:

Pointing the Finger at the Wrong Suspect

You can see that one of the Unicorn workers started to die repeatedly with fairly ominous error messages. At this point, my first thought was that the NewRelic gem was causing the reboots, because of this stack trace:

But here’s the spoiler: it turns out NewRelic was just the fall guy! Here’s why. If you take a look at the log-runtime-metrics gist above, you can see that the metrics are emitted every 20 seconds. Well, hmm, what needs a timer thread like that? After a bit of thought, it hit me: log-runtime-metrics, of course!

Catching the Real Crook

As you can see from the logs, the time that the thread creation error occurs is almost exactly 20 seconds from the last metric report — which tells me that log-runtime-metrics is trying to start a thread to wake itself up in 20 seconds for the next report. However, it encounters difficulty doing so:

This starts a cascading chain of failures which causes the entire dyno to reboot:

log-runtime-metrics obviously has a thread leak, as no further threads can be created. I surmise that log-runtime-metrics is not joining its timer threads when they are done and that it is running into the PTHREAD_THREADS_MAX limit in the underlying Ubuntu OS that runs the dyno.

Note that during this two-minute thrashing interval, no further runtime metrics are reported, so log-runtime-metrics really is dead.

After the reboot, the metrics on web.1 start reporting every 20 seconds again, even before Unicorn is booted:

All this definitely points toward log-runtime-metrics as the culprit in this scenario. Accordingly, I have turned log-runtime-metrics off in my production system, and there have been no more unexplained dyno reboots, after a few hours of run time. I will keep an eye on it over the next 24 hours to verify that log-runtime-metrics did indeed cause the problem.

So the next time you see this notice:

heroku-beta-feature

…pay attention and take it at face value, it really is an experimental feature!

Unabashed Self-Promotion

Resurrection

April 25, 2013

Miscellaneous

2 Comments

Well hello out there.

It’s been a while since I’ve blogged, and a lot has changed since I last posted nearly two and a half years ago. For one, I no longer live in Montreal, and I am now a freelance Ruby-on-Rails consultant (or as Austin calls it, a “CTO-on-Demand”).

Accordingly this blog will probably be taking a slightly more technical bent than it used to.

If any of my old readers are still around, I’d love it if you said Hi in the comments!

Standout Jobs Acquisition Followup

October 19, 2010

Miscellaneous

1 Comment

I’m coming out of blogging hiatus to provide an update on the Standout Jobs acquisition, which I wrote about a few months ago. At that time we could not yet name the acquirer. Today, the veil has been lifted — Talent Technology has announced its acquisition of Standout Jobs. (As an aside, ERE broke the news first, I think!)

As for me, I’ve been working with Talent Tech in Vancouver for quite a few months now and will soon be completing my mandate as the former CTO of Standout Jobs. It’s been a great experience working with the Talent Tech team and I truly wish them the best. They will take care of Standout Jobs well and I feel confident handing over the reins to them.

In the new year I should have some news about my next set of adventures, but for now I will again go into a blogging hiatus. So if you’d like to keep in touch please follow me on Twitter (@fredngo)!

Standout Jobs has been Acquired!

May 3, 2010

Miscellaneous

4 Comments

It’s been a long journey since Ben, Austin, and I started Standout Jobs — unbelievably it’s already been three and a half years. But today, we’re happy to announce that Standout Jobs has been acquired.

Ben has already written up a nice post about it, so please go ahead and read about the details there.

It is unfortunate that we cannot name the acquirer at this time. However, I am very excited about their plans for the product, and I have already been working with the acquirers for some time transitioning the product. I will be sure to post an update when the acquirer is ready to go public.

I’d like to recognize everyone who has been involved in the Standout Jobs project from start to finish — we had the most amazing team I’ve ever worked with and I know that everyone will go on to continue to do great things. So, shout outs go to Ben Yoskovitz, Austin Hill, Marc-Andre Cournoyer, Daniel Haran, Gary Haran, Francis Wu, Thanh Vinh Tang, Martin Aumont, Alain Wong, Natasha Duchene, Yezin al-Qaysi, Steve McNair, Alexandra Rimmington, as well as our lead investors iNovia Capital and Montreal Start Up.

I started this blog with the idea of documenting this journey. Unfortunately, at least for me, it seems we spent so much time working that little energy was left to document it. The other strange thing is that we were very often prevented from discussing details of goings-on, which I suppose is to be expected in the business world. Still, we will all carry the memories of this adventure with us forever.

And so, fittingly, this blog will end here for now, until I go on to my next set of adventures. Until then, please follow me on Twitter (@fredngo) if you’d like to keep in touch.

Impact Expo(se)

October 6, 2008

Public Speaking

7 Comments

I had a really great time speaking at the Impact Expo(se) event last Saturday. Thanks to Iain and Lisa from Communitech for inviting me, and thanks to the entire Communitech and Impact teams for organizing the event. Believe me I know how hard it is to organize these things!

A number of people asked me to post the slides from my talk at the IMPACT Expo(se) event this past Saturday. I don’t know how much sense they will make without the accompanying sound track, but here they are nevertheless:

10 Tips for Startup Entrepreneurs

And yes… There are only 9 Tips, that’s because I ran out of preparation time. We are in crunch mode at Standout Jobs right now, afterall… ;-)

Now, back to work!

Speaking in Waterloo on October 4th

October 3, 2008

Conferences

Declaring Email Bankruptcy

July 29, 2008

Cat's Corner, Email, Productivity, Standout Jobs

7 Comments

My friends, please let it be known that today, I am declaring Email Bankruptcy.

I first heard about this concept when Lawrence Lessig declared email bankruptcy in 2004. Fast forward to today, and I find myself in a similar predicament… As of this writing, I have more than 900 unread emails, and I do not have enough hours in a day to process all my incoming emails and my backlog… and so, I’ve junked my entire backlog of emails.

I am very sorry, but if you’ve sent me an email and never gotten a response back, chances are I never actually even had a chance to read it and a reply is not forthcoming. If it is very important, please resend it to me again and I should be in a better position to respond.

A Fresh Start

As Standout Jobs heads into its next stage of development, it becomes very important for me to have a clear plate and a clear mind. This also applies to Cat’s Corner, which is moving to a new location as of Friday August 1st.

Taking this drastic step will give me the fresh start that I need.

It remains to be seen whether I will be successful. I could easily spend all my working hours reading and writing email. But if I did that I wouldn’t be able to get any real work done; which, for a techie, requires long hours of concentration without distraction. So the logical conclusion is that any overflow email will continue to accumulate until I have to declare email bankruptcy again.

Email by its nature has its limits. It’s a gift that keeps on giving. If I could get away with it, I wouldn’t use email at all!

How to Contact Me

Again, my apologies if you’ve emailed me and never gotten a response. But now, knowing my non-preference for email, let me give you the secret to contacting me and actually getting a response… If you’d like to contact me, please try the following channels first:

Phone
Instant Messenger: I’m on Skype, MSN, gTalk, Yahoo! Messenger, AIM, and ICQ. (Protip: Use Adium to not go crazy with multiple chat clients)

And if you still absolutely must send email, please send them to the appropriate application-specific email addresses for a higher rate of response:

Standout Jobs Stuff: fred at standoutjobs dot com
Cat’s Corner Stuff: fred at catscorner dot ca
Social Life Stuff: Facebook messaging (Be warned that I will ignore anything business-y on Facebook.)
Anything else: fredngo at gmail dot com

Escaping the Tyrrany of Voicemail with PhoneTag

July 5, 2008

Productivity

14 Comments

Today’s TechCrunch article, Think Before You Voicemail, struck a nerve with me. I’ve very often felt the same way for the exact same reasons Michael outlined:

It takes much longer to listen to a message than read it. And voicemail is usually outside of our typical workflow, making it hard to forward or reply to easily.

The big problem with voicemail is that it takes a context switch out of your normal workflow in order to dial into the system, figure out how to fast-forward/delete messages, transcribe messages, etc. This might sound minor but it really takes its toll when you’ve already got a ton of things on your plate to deal with on a daily basis.

Other minor problems with voicemail include long distance charges to dial into my voicemail when I’m traveling, and the delay caused by my tendency to not notice that I have a voicemail until the next time I make a telephone call.

So today I’ve taken a big step toward escaping voicemail. I’ve set up my cell phone such that voicemails are transcribed automatically into text messages and sent to my email (and optionally SMS’d to me). How did I do this? It was not straightforward so I am repeating the steps here for anyone who’d like to do the same.

Without further ado, here’s how I set up PhoneTag with my iPhone running on the Fido network.

Step 1

The first step is to subscribe to a voicemail transcribing service. The service I ended up subscribing to is PhoneTag. There are several other services, including YouMail and SpinVox, but nearly all suffer from one big flaw — they do not work with Canadian numbers. (Don’t even get me started on how we always get shafted on technology in Canada…)

Fortunately PhoneTag does work in Canada, so they get my dollars. (Not much actually, only 35 cents per transcribed message; other plans available too.)

Step 2

Once you’ve set up your PhoneTag account, you have to instruct your carrier to use PhoneTag’s voicemail box when you don’t answer your phone. This is accomplished via Conditional Call Forwarding. In layman’s terms, you have to ask your carrier to forward a caller to voicemail when you 1) don’t answer, 2) are unreachable, or 3) already on the phone.

When you get a phone from Fido, it’s already pre-programmed to forward calls to the Fido voicemail. In this case, we want to switch to PhoneTag’s voicemail. Here’s where things get complicated a bit; I am using an unlocked iPhone on the Fido network, and unfortunately, I could not find a way to set Conditional Call Forwarding inside the iPhone’s interface. However, we can get around this by directly programming appropriate GSM Commands into the phone!

The GSM Commands I used were the following:

Call Forwarding if No Answer: *61*[dest]#
Call Forwarding if Unreachable: *62*[dest]#
Call Forwarding if Busy: *67*[dest]#

where [dest] is the destination of the call forwarding, i.e. the phone number of PhoneTag’s mailbox. So as an example, if your mailbox’s number is 514-123-4567, then the appropriate sequence to program Call Forwarding if No Answer would be *61*5141234567#.

So, I opened up Keypad of the iPhone, typed in and called each one of the above mentioned codes, and voila; my callers are now directed to PhoneTag’s voicemail. (You’ll know you did it right if you see some status messages.)

Now voicemails are automatically transcribed into text and sent to my email, where I will see it the next time I return to my workstation.

Step 3

For completeness’ sake, rewire the Voicemail button on the iPhone to point to your new mailbox by issuing the following command:

*5005*86*yourvoicemailphonenumber#

In PhoneTag’s case, yourvoicemailphonenumber is 18007840457.

But of course we’ll never use that button again, right?

Step 4

Finally call Fido Customer Service at 611, and cancel your voicemail option, saving you a few bucks per month.

There you have it, escape from voicemail.

Now if only I could escape from email too, then life would be heaven!

Standout Jobs Launching at DEMO 08 – Live Broadcast Today!

January 30, 2008

Standout Jobs, Startup Life, Venture Capital

1 Comment

I’m here in Sunny California, at the beautiful Marriott Desert Springs Resort where the DEMO conference is being held this year. Yup, Standout Jobs was selected to present at DEMO!

It’s been a long journey for Standout Jobs, and this week marks a major milestone–Standout Jobs has launched our first product, Reception, at DEMO 08. We waited till exactly 7am EST on Monday January 28th, which is when the embargo is lifted according to DEMO rules, to launch the product. The reception (pun intended) has been amazing, we were covered by TechCrunch, GigaOM and Mashable, among the major news sites, as well as local sites such as A frog in the valley, I never nu, Web1979, MontrealTechWatch, TechCFO, Mikel.org, Guillaume Thoreau, and Quebec Valley.

(If you’re interested in all the details, you can of course consult the
Standout Jobs Blog.)

Most of my readers probably also read the likes of my cofounders Ben Yoskovitz and Austin Hill, so all this news is no surprise to you… But I do want to remind you that we will be presenting at 12:44PM EST LIVE on the web at demo.com, and this is where you’ll be able to finally see just exactly why I’ve been so reclusive over the last year! ;-) But to be completely serious for a second, we’d love to have your virtual moral support.

I’d like to also give big props to the Standout Jobs A-Team that made this happen–In alphabetical order:

Thank you for your hard work everyone. We’ll celebrate when I get back!

—Silicon Island

The Nomadic Rails Coder

Beware of the Zombie (Process) Apocalypse

Or: How to Debug 503’s, H12’s, and H13’s on Heroku

Introduction and Apology

Continuing the Search for the Real Killers

ZOMG! Zombie Processes!

Unabashed Self-Promotion

Heroku’s log-runtime-metrics Can Crash Your Rails App

Introduction

Emerging Problems

Investigating the Mystery

Pointing the Finger at the Wrong Suspect

Catching the Real Crook

Unabashed Self-Promotion

Resurrection

Standout Jobs Acquisition Followup

Standout Jobs has been Acquired!

Impact Expo(se)

Speaking in Waterloo on October 4th

Declaring Email Bankruptcy

Escaping the Tyrrany of Voicemail with PhoneTag

Standout Jobs Launching at DEMO 08 – Live Broadcast Today!