Archive

Programming

Or: How to Debug 503’s, H12’s, and H13’s on Heroku

TLDR: I was wrong about log-runtime-entries! Turns out my unexplained dyno crashes were caused by a Gem that was misbehaving in its process forking action, which I’ve now fixed and submitted a pull request for.

Introduction and Apology

In my last post, I mentioned that I would keep an eye on my Heroku Rails app over the next 24 hours to see if removing log-runtime-metrics would indeed alleviate my problem with the unexplained dyno reboots that were causing 503, H12, and H13 errors.

Unfortunately, like a bad zombie movie, the problem just wouldn’t die. Overnight my dynos again suffered a reboot, and I again saw the dreaded ‘[FATAL] Failed to create timer thread (errno: 11)’ in the logs.

So, the problem appears not to be log-runtime-metrics after all. My apologies for raising the alarm too early — I should have waited the full 24 hours first.

Continuing the Search for the Real Killers

So what the heck is the actual problem? Something in the app is leaking threads, so how do I figure out where?

To verify my leaky thread hypothesis, I whipped up a short bash script which I then ran inside the (Warning: Big file) Ubuntu Lucid 64 Vagrant box where I develop my Rails app.

I use the app for a while, and sure enough, I see the following scroll by:

43, 43 ... 45, 45 ... 44, 44 ... 46, 46 ... 45, 45 ... 47 ...

and I watch the count slowly increase, 2 steps forward and one step back, 2 steps forward and one step back. At this rate the number of threads being used would surely bust a limit given some time. It looks like I finally found the trail of blood. Something in my app was definitely creating threads and not cleaning up after itself.

To see if I could gain further insight, I took a closer look at the process list:

ZOMG! Zombie Processes!

Wow, so this is not only a thread leak, this is a process leak! Now I know that something in the app is forking processes and then failing to wait for the child process to complete. Over time this will result in thousands of Zombie Processes, causing a Zombie Process Apocalypse to wreak havoc on your dyno. (For a great discussion of what exactly a Zombie Process is, see this blog post by the Phusion Passenger team — look for “Not all frozen processes are made equal”)

zombie cats-01

I know that nothing in my Rails app is so clever as to try and fork processes (really bad idea), so the culprit has to be inside a Gem. After (a while) narrowing down some of the suspects, I finally found that the rack-streaming-proxy gem that I was using was forking processes, and indeed the parent processes were not waiting for the child processes.

I finally found the root cause of the Zombie Process Apocalypse.

Unfortunately this is a very old gem and is not maintained anymore, but I went ahead and fixed the bug, and submitted a pull request. The details of how the bug fix works is very interesting in its own right, and I may write a follow-up post about it.

But for now, in my experiments using the short bash script above, I’ve found that the fix works at making sure no more zombie processes are generated.

Overall this was definitely one of the hardest things I’ve had to debug; it’s one of those problems that only shows up at scale and after running for a long period of time. But at least for now, the Zombie Process Apocalypse has been averted!

Unabashed Self-Promotion

Got problems like these? I am a 7-year Ruby on Rails veteran, who previously founded (and subsequently exited) a startup as a CTO. I am available for short- and long-term consulting contracts involving CTO-level advice and Ruby on Rails project work. Please free free to get in touch.

Update 2: I was wrong about log-runtime-metrics! Turns out the culprit was a concurrency bug inside a Gem. Read how I fixed it here.
Update 1: Hacker News discussion here.

TLDR:  As of April 25, 2013, do not run log-runtime-metrics for extended periods of time, as it can crash your Heroku dynos (and cause them to reboot) multiple times daily. I was wrong! Please see Beware of the Zombie (Process) Apocalypse

Introduction

In the wake of the Rap Genius brouhaha, Heroku has been on a tear releasing many new features intended to help you tune and debug your Rails app’s performance. One of them is the Heroku Labs feature log-runtime-metrics, which combined with the Log2Viz feature gives you great insight into your app’s memory usage. You get a detailed breakdown of memory usage per dyno in your logs, and armed with this information you can scale your Unicorn worker count accordingly.

Emerging Problems

With this kind of amazing visibility into your dynos, it is tempting to leave this Labs feature enabled. After all, what can go wrong? Well, it turns out, lots. After having this feature enabled for a few days, I started noticing some unexplained dyno reboots inside NewRelic.

unexplained-heroku-dyno-reboots

These restarts were causing 503 timeouts, with Heroku H13 errors manifesting in the logs. (NB: In my case, I am running Unicorn with multiple processes, so each of these instance restarts is actually a single Unicorn worker restarting.)

Investigating the Mystery

Digging deeper into the mystery, I went into the Logentries add-on and examined the Heroku logs around that time. Specifically, I was not receiving any exceptions inside AirBrake so I knew it had to be a middleware problem or a system-level kind of problem.

So let’s take a look at the logs. When things are running fine, you can see that log-runtime-metrics outputs your dyno’s stats every 20 seconds or so.

(As an aside, I am using a 2X dyno, so my memory usage of ~811 MB is not a problem — in fact, it looks like I have enough memory to add another Unicorn worker. Again, log-runtime-metrics is great.)

Around 04:11:21 things start to go wrong:

Pointing the Finger at the Wrong Suspect

You can see that one of the Unicorn workers started to die repeatedly with fairly ominous error messages. At this point, my first thought was that the NewRelic gem was causing the reboots, because of this stack trace:

But here’s the spoiler: it turns out NewRelic was just the fall guy! Here’s why. If you take a look at the log-runtime-metrics gist above, you can see that the metrics are emitted every 20 seconds. Well, hmm, what needs a timer thread like that? After a bit of thought, it hit me: log-runtime-metrics, of course!

Catching the Real Crook

As you can see from the logs, the time that the thread creation error occurs is almost exactly 20 seconds from the last metric report — which tells me that log-runtime-metrics is trying to start a thread to wake itself up in 20 seconds for the next report. However, it encounters difficulty doing so:

This starts a cascading chain of failures which causes the entire dyno to reboot:

log-runtime-metrics obviously has a thread leak, as no further threads can be created. I surmise that log-runtime-metrics is not joining its timer threads when they are done and that it is running into the PTHREAD_THREADS_MAX limit in the underlying Ubuntu OS that runs the dyno.

Note that during this two-minute thrashing interval, no further runtime metrics are reported, so log-runtime-metrics really is dead.

After the reboot, the metrics on web.1 start reporting every 20 seconds again, even before Unicorn is booted:

All this definitely points toward log-runtime-metrics as the culprit in this scenario. Accordingly, I have turned log-runtime-metrics off in my production system, and there have been no more unexplained dyno reboots, after a few hours of run time. I will keep an eye on it over the next 24 hours to verify that log-runtime-metrics did indeed cause the problem.

So the next time you see this notice:

heroku-beta-feature

…pay attention and take it at face value, it really is an experimental feature!

Unabashed Self-Promotion

Got problems like these? I am a 7-year Ruby on Rails veteran, who previously founded (and subsequently exited) a startup as a CTO. I am available for short- and long-term consulting contracts involving CTO-level advice and Ruby on Rails project work. Please free free to get in touch.

I was planning to write a post called “What Makes a Good Software Entrepreneur?” when I came across this article about the Top Signals of Success for Software Entrepreneurs. Great! Now I don’t have to write the post. ;-)

So the Top Signals of Success for Software Entrepreneurs are:

  • You like to experiment
  • You like to read and learn
  • You like to tinker and build
  • You are opportunistic
  • You are an artist
  • You Live On Email
  • You are considerate, respectful and nice
  • You have a proclivity for action
  • You attract others like yourself
  • You are a realist
  • You are exceptionally intelligent
  • You are fundamentally likable
  • You exhibit “balanced frugality”
  • You work hard
  • You just love the game

(Be sure to read the article for explanations.)

While I don’t agree with everything on the list (pretty much only because I hate email ;-) ), on the whole I thought the article got it right. It’s hard, however, to self-evaluate many of these characteristics.

Of course I can’t really speak with any real authority on the subject yet, being only an entrepreneur-in-training… So I’ll be sure to revisit this subject again when I hit one out of the ballpark. ;-)

Our friend Mat Balez is looking for a partner for his project: codename::ideack.

From his posting:

I’m looking for someone to help me build the thing, likely in a partnership capacity. Someone with significant Ruby on Rails (or Python/Django if you are convincing) and Javascript expertise to lead the site’s development, and with design chops sufficient to create something that might be characterized as ‘easy on the eyes’.

If you are this person, you will be:

– passionate about web technologies
– capable of building software that will scale
– eager to work hard to help bootstrap a cool new web project
– an all around good person
– preferably based in Montreal

If interested, send me an email and we’ll get together to chat further.

Best of luck in your partner search Mat! I’ll be fully supporting you in this project as we start to create a startup culture in this town of ours. :)

PowerBook

A quote from Paul Graham’s latest essay “Microsoft is Dead“:

I’m now surprised when I come across a computer running Windows. Nearly all the people we fund at Y Combinator use Apple laptops. It was the same in the audience at startup school. All the computer people use Macs or Linux now. Windows is for grandmas, like Macs used to be in the 90s.

This phenomenon is easily observed at the BarCamps and DemoCamps here at home as well…

I too have to confess that I’ve also switched to Mac. I made the switch when the new Core 2 Duo MacBook Pros were released last November, but interestingly enough, my reasons for switching were more pragmatic rather than idealistic.

I was at a point where I needed to replace both my old desktop (4 years old) and my old laptop (2.5 years old). I added the numbers up, and figured that the new MacBook Pro would serve just fine as a new laptop and desktop replacement, for basically the same amount of money!

Another big reason for the switch is that I had already been using Linux on my personal machines for more than a year (and using it professionally for way longer than that), and the fact that OS X is built on top of a Unix kernel is just phenomenal. It’s like having the command-line power of Unix without the crappy user interface.

One big enabling factor for the switch is the mass migration to online tools. I used to be an Outlook Slave for many years, but ever since I adopted GMail, wikis, and other online tools it was much easier to make the switch to Linux and then Mac.

So far I’m very happy. My MacBook Pro does have some glitches; it crashes hard occasionally, but mostly when entering or exiting sleep mode. I’ll have to get that looked at some time… but it’s probably a tiny hardware glitch and nothing that the Apple stores can fix anyway.

One great thing about OS X is that, in contrast with Windows, it rarely ever asks you to “Confirm” stuff, and 3rd party software rarely ever forcibly notifies you of silly things (“An update for your HP printer driver is available!”). The OS just gets out of the way and lets you work.

I’ve had to learn a whole new set of tools on the Mac, but it’s been really fun and worth it! I’m now using the fabulous TextMate for day-to-day work and let me tell ya… I can’t and won’t ever go back to plain-old Emacs and plain-old X-based UI again.

One other thing; and I’m not sure if it’s just a placebo effect, but I feel I’ve been way more creative since I started using a Mac. Fact or Fiction? Who knows. :)

Ha ha, guys… Don’t take anything posted on April Fool’s Day too seriously. ;-)

Time now for a real post: a report from BarCampOttawa3.

One Laptop Per Child
One Laptop Per Child
Originally uploaded by fredngo.

I am writing this blog post at BarCampOttawa3. The Ottawa crew have done a great job organizing; and I heard that up to 200 people attended this edition! Great job guys.

Compared to Montreal, I noticed that there is a much bigger cross-section of the technology world represented here. In addition to the usual Web 2.0 enthusiasts, there are also traditional application developers (C++, etc.) as well as microchip designers, RF design engineers, embedded systems designers, and even one guy who has a startup working on internet-connected door locks. All of this makes for a much bigger cross section of possible discussion topics and a much more concrete reason to run multi-tracks. (They are running 4 concurrent tracks at this BarCamp.) Until we start getting this kind of crazy cross-sectional representation, I think our current tack with a “1.5” track BarCampMontreal2 will work very well (1 big presentation room, 1 smaller discussion room).

Another thing that was immediately obvious was a real mix of people and cultures: Indian accents, German accents, Japanese accents… Ah, wonderful multiculturalism… In this current climate of backlash against minorities in la belle province, and as a member of a minority myself, I want to emphasize how much fun it was and very refreshing to be around this kind of atmosphere. We have much to gain from accepting other cultures into our reality.

OK, enough political commentary; let’s get on to my thoughts on the presentations. I had time to write up two of them.

10 (or so) Legal Considerations when Starting a Tech Company

One great presentation I attended was by Mr. Mike Dunleavy (md@lwlaw.com, 613.599.9600 x268) from the law firm of Labarge Weinstein, called 10 (or so) Legal Considerations when Starting a Tech Company.

A couple of things I learned from the presentation:

The lifetime capital gains exemption has recently been raised to $750,000.

You can establish a family trust to save potentially millions in taxes; the idea is to split the shares of your company with family and friends, in order to use their exemption limits. So when your company is sold for $7.5M and you’ve used a family trust to split the shares among 10 friends and family, the entire amount is tax-free. Seems like a huge tax loophole to me, but at the moment it’s a perfectly legal maneuver.

He talked about many other important things including IP Agreements, Shareholder Agreements, and Tax considerations. It was a fascinating presentation and doubly more so because it is a Canadian-based view (previously I had only attended such presentations in the U.S.).

I’d personally love to see this type of legal presentation at BarCampMontreal2, so if you’re a Montreal lawyer specializing in helping Tech Startups, please contact me. It would be wonderful to have you with us.

Ruby on Rails and What It Entails

Tobias Lütke
Tobias Lütke
Originally uploaded by fredngo.

Another presentation I attended, Ruby on Rails and What It Entails, was given by Tobias Lütke, a CoFounder of jadedpixel.com. They are the guys behind the amazing Shopify.

I found it interesting that the first and foremost point he made was that Ruby on Rails developers believe in “Beautiful Code”. The reasoning is that Beautiful Code leads to Happiness. Happiness leads to productivity. Productivity leads to met deadlines. Finally, met deadlines leads to successful products.

I really agree with this; throughout my various incarnations as an engineer and developer I’ve spent an inordinate amount of time hacking on really, really, really, ugly code. As soon as a system becomes a steaming pile of junk, it’s just no longer “fun” to work on, and having fun is really key to productivity.

Tobias claims that happy teams of 4 using RoR outperform normal teams of 40 using outdated technology.

Another interesting thing he said is that it is extremely easy to replace pieces of Ruby on Rails with natively compiled C, so as you observe the loads that are placed on your application you can optimize the most time-critical parts if necessary.

I do really have to investigate Ruby on Rails a lot more.

Programming Books

I’ve been getting some email from interested readers wondering what I’m up to lately. It turns out that being a tech entrepreneur isn’t all that exciting day-to-day, especially when you’re in stealth mode… because there isn’t much to talk about!

Especially in my case, since I’m in full-on learning mode. I hereby present exhibit A, my stack of programming-related books. (I’ve run out of space on my bookshelf so they’re sitting on the ground.) The top 8 books have been acquired within the last month, and I still haven’t even gone through 10% of any one of them. On the other hand, in the last month I’ve probably learned more than I have in all of 2006!

This leads me to a tangent — books are still the best way of learning stuff, even in this age. I first thought I could do away with books and just rely the internet for what I need to know, but it turns out that when I find a webpage or a forum posting about how to do something, I can never be sure what version of the software the answer applies to. Is the answer about Apache 1.x or Apache 2.x? Is it about PHP 4 or PHP 5? I can make a guess based on the date, but that obviously sucks. However, when I buy a good book, I know that the quality of the writing will be high, and I can be sure that it covers the latest revision of the software.

Anyway, all this to say that I’m in hermit-hacker mode these days, with the accompanying hermit-hacker schedule in full effect. This means rarely going to bed before the single digits, and rarely waking up before the double digits. But don’t get me wrong… it’s freakin’ fun.

Besides, it’s minus double digits outside… What else could I possibly be doing? :)