Erlang and unreliable resources

Here’s something you should be aware of if you’re using Erlang in a large system.  It’s a pity the core team hasn’t included something like it in the Erlang distribution itself, or event documented; because it’s something I think you’ll eventually need if you build a big enough system.

In a large system, you’ll have components that should be up and running – but at some point might not be.  For instance, a web site with a database.  If you write your Erlang code to just “let it crash” because the web server can’t connect to the database – just like you read in the books – then your whole site will fall over, web server and all.  At this point, you can’t even show your visitors a message explaining that “we’re experiencing problems, please try again later”, because the problem has propagated throughout the system, bringing it all down.  Another scenario might be a machine with a user interface and some fragile hardware.  The machine can’t do its job without the hardware, but if the hardware is broken, the software should not crash completely!  The user interface needs to stay active, letting users know that the machine is broken, and perhaps offering some diagnostics or offering some steps to try and correct the problem.  What these have in common is that there are some components of the system that should always try to be available even if they are not 100% functional, and there are other bits and pieces that may be necessary for the correct functionality of the system, but should not cause it to become entirely unavailable when something goes wrong.

The term for this is “circuit breaker”, because it breaks the chain that is part of a normal OTP system, where enough failures of a worker lead to a supervisor crashing, and its supervisors crashing in turn, after enough failures, and so on up the chain.  One high quality implementation is located here: https://github.com/jlouis/fuse – and politely points to some other, similar implementations.  I’m a bit frustrated that I found this only recently, because it’s a nice bit of code that strikes me as quite likely to be used somewhere withing any sufficiently large Erlang system.

As a footnote, I also created something similar, although it’s not battle tested (I just released it, actually!) and operates at a different level: https://github.com/davidw/hardcore – but it might be useful for some people.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s