Circuit Breakers, Hystrix, And Dealing With Failing Back Ends

When you are writing middleware (be it SOAP services, REST APIs, or something else) an important point to realize is that Back Ends Fail. They fail in strange and interesting ways. Code for talking to back ends should always be robust: it should never make a call without some timeout, should always be prepared for the response to be badly formatted, and should always test whether fields are valid before relying on that. (Of course, when one of these things fails it is fine for the middleware service to return an error to the front-end calling it — it just isn’t OK for the middleware service to do something like lock up a thread or crash the server.)

But despite all that defensive programming, sometimes back ends will fail in a way that causes errors. After all, the defensive programming was probably not part of your unit tests or test cases, and we all know that untested code sometimes fails. So what happens when the back end goes haywire and somehow starts bringing down your middleware?

Well, what happens is that the production support team springs into action. Situations like this are exactly why we have a team of skilled professionals who carry a beeper and provide 24/7 support for our critical systems. The monitoring that we have recognizes that a problem has occurred, the people involved either recognize the problem (“Oh, look: TSYS is acting up again!”) or they try to rapidly diagnose it (“Quick: check the Oracle connections. It’s affecting all the clusters so there’s a chance that it’s the database.”). Once they know the problem they perform rapid triage (which, honestly, is usually just to take down or reboot the affected servers) and call in the Tier-3 support team to identify the root cause and provide a fix. Often these problems are short-lived or intermittent and after a few hours things start working again.

But could we do better? What if there were a way to partially automate the effort that the production support team makes in this case? We can’t automate the judgement needed to understand the problem and to provide a fix, but we might be able to automate the process of shutting down the offending parts of the system.

That is exactly what the Circuit Breaker Pattern does. This pattern says to wrap any problematic code (like code that talks to a back-end) in some code that manages the connection. It will count the number of errors and when that exceeds some threshold it will assume that the back-end is misbehaving. Then the circuit breaker STOPS TRYING TO CALL THAT CODE. Instead, all calls will immediately return with an error. The circuit can be restored manually (after the production support team decides that things are stable again) or automatically Sample view of Hystrix Console (allow through 1 attempt ever x minutes and restore things if it works), depending on what behavior you desire.

Netflix is a company famous for their approach to building software so it is rugged, and works even in difficult circumstances. These are the folks who invented and deployed Chaos Monkey, an application that literally runs around breaking things in their data center just to keep them on their toes. And they have released a library for implementing the circuit breaker pattern. The Hystrix library is a Java implementation of the pattern and it has a good number of bells and whistles like the console you can see in the image to the right.

Within the company where I work, we have been using the Hystrix library for some time now. Since its introduction it has proved to be useful and reliable so we have been expanding its use. I definitely recommend the library for those who want an automated means of recognizing problems and shutting them off quickly (within tens of milliseconds — far faster than any production support person could possibly react) in order to limit the damage done by misbehaving systems.

My Letter to the FCC on Net Neutrality

FCC Logo - a trademark of the FCC. Used here under fair use to illustrate the fact that I am discussing communications with the FCC. The seal does not indicate that anything about this post was endorsed by the FCC.

Today is the deadline for providing public comment to the FCC on whether to classify broadband communications under Title II — basically, whether or not to enforce Net Neutrality in a way that works. The following is the letter that I submitted (by email, since the FCC’s website for posting comments was not functional):

To FCC regulators:

In relation to proceeding 14-28, I would like to express my support for regulating broadband (including wireless internet providers) under Title II (as a “common carrier”) to enable the enforcement of “net neutrality”.

By “net neutrality” I mean the principle that the delivering of internet communications should be independent of which particular entity (application or individual) is transmitting it. In other words, an internet carrier (whether broadband or traditional, wired or wireless) should not be able to block traffic to one of their customers simply because it comes from certain applications. Nor should they be able to degrade service, by offering a different speed or a different error rate for different endpoints their customers might try to reach. Regulation under Title II could achieve this.

One can imagine many ways that such discrimination could be abused. The most egregious would be if a major provider like Comcast were to interfere in the democratic process by blocking (or just degrading) access to the donation pages for certain politicians — and I think we can all agree that this is unlikely. Subtler in effect but similar in kind would be for a major player like Verizon Wireless to provide greater bandwidth (thus greater speed) for one company (, perhaps) rather than another ( In this hypothetical situation Verizon might engage in this behavior because of direct kickbacks (payments) from Amazon, or as a threat to persuade Walmart to sign an unrelated contract with them. This hypothetical is far more plausible, but would be, in many ways, just as harmful. As a final example, some major internet providers such as Comcast or Verizon might intentionally manage their network so as to reduce bandwidth from certain sites their customers connect to (Netflix, perhaps) in order to demand that this third party (not the ISP’s customer) pay them additional amounts — this example is no mere hypothetical, it has HAPPENED ALREADY.

Perhaps none of this would be necessary if there were hundreds of small providers of broadband internet service giving each customer the choice of 5, 10, or more providers. In such a situation market forces might allow consumers to select from providers and choose those that did not degrade service for their favorite destinations. But that is not the world that we live in. The FCC’s own measurements (December 2013) show that two thirds of customers had access to 2 or fewer wireline broadband providers, over a quarter have only a single provider. Providing network connectivity is a natural monopoly because of the cost of placing wires (or wireless stations) and the network effects of doing so heavily in a certain location.

Imagine if, 15 years ago, internet providers had charged a “reasonable” fee for transmitting video over broadband. This would have been eminently reasonable, given that most broadband providers at that time were (and still are) in the business of selling such a service (“cable television”). Imagine that their “reasonable” rates were just 1 tenth the consumer cost of their own offerings (1 tenth the cost of a normal customer’s cable bill) — that would have seemed quite reasonable to any regulator. But under such an environment, YouTube could never have begun. YouTube introduced a completely new business model — no one had ever offered free hosting and viewing of small customer-created clips of video. Beforehand, no one could have known whether that model would have succeeded or failed, but without net neutrality it could never even have been tried. And without YouTube, we would not have things like Khan Academy and hundreds of other projects to provide training videos on every subject.

YouTube is not the last great invention; there will be new innovators in the future who will create new industries that spur our economy and benefit all of society. And although I do not know what these innovations will be, I can say with confidence that these new industries will make use of the internet. But they will only be able to do so if you impose regulations now that enforce a policy of net neutrality; if you do not, then the next YouTube will simply not occur.

Please take this advice into consideration in your rulemaking.

Michael Chermside
2936 Morris Rd
Ardmore, PA 19003


What’s the “right” way to abandon an open source package?

In the Python discussion group, Skip Montanaro posted the title question: what’s the “right” way to abandon an open source package?

He got one detailed and helpful answer from Ben Finney.

It was an excellent question and an excellent response. I thought it was worth sharing here.

Using the Legal System To Access Customer Data

A week or so ago Microsoft dug into a customer’s Hotmail account in order to track down some information about code that had been stolen from Microsoft. Their terms and conditions specifically allowed them to do this, but despite that they received a fair amount of criticism.

With this announcement they have decided to change their policy. Now they will only access private customer data in response to a law enforcement request — if a similar situation arises, they will ask law enforcement to investigate (by asking them to provide the data).

This is an excellent decision. Our legal system may not be perfect, but it has all kinds of checks and balances built in to prevent abuses and to balance individual’s rights against the public need to perform investigations. Rather than inventing their own “legal system” for adjudicating such things fairly, Microsoft is taking advantage of the existing system our society has built up over centuries. Other “cloud providers” (for that is exactly what web mail is) should adopt the same policy.

Reasons Why My Code Style is Wrong

I have had people tell me things like “You should never throw an exception to return a value in an unusual case, exceptions are only supposed to be used for error conditions.” And I HATE it when people say things like this.

There are a couple of audiences at play when we write code. There is the machine that needs to compile and then execute the code. If the code won’t compile or if it runs slowly, wastes memory, or produces incorrect results then we have absolutely failed. The other audience is future readers of the code — other developers who will need to read and maintain the code or even ourselves who will come back months later and say “Oh my God, what was I thinking!” when we read it.

But communicating with these two audiences (the computer and the reader) is all we are doing — we are not playing a game by some arbitrary set of rules. There are no “software police” who will come along and arrest us if we throw an exception, use a goto, fail to start our service name with a verb, or use a URI for our REST API that is inconsistent with some other part of the ontology. (Actually, there may be software police, but if so it is only because they have chosen to self-appoint themselves to that role.) Doing these things might make the results incorrect (for the computer) or difficult to read (for the developer) and that would be bad, but only because it was incorrect or difficult-to-understand, not because it broke some arbitrary rule.

So I would much rather hear someone say one of these things to me:

  • “Don’t throw an exception to return a value in this case because creating exceptions is slow, the situation occurs somewhat frequently, and it will reduce the performance of this function which is used in a tight loop and is therefore performance critical.”
  • “Don’t throw an exception to return a value in this case because there is error-handling code that will log the exception as an error before it is caught by the handler, thus producing spurious error reports.”
  • “Don’t throw an exception to return a value in this case because it will hide the fact that the function actually returns a value and that will be confusing to someone not familiar with it.”
  • “Don’t throw an exception to return a value because the other code in this module never does that and it will be inconsistent and therefore difficult for readers of the code.

In other words: It’s great that you’re telling me a different (or better) way to write some code, but don’t tell me to do it because that’s “the right way” (appeal to authority). Instead, tie it back to an actual benefit like correctness or readability.

Book Review: Learning jQuery Deferreds

I don’t often write book reviews here, but in this case I have a connection to the book. My friend Terry Jones was one of the authors of a new O’Reilly programming book (you know, the ones with the animal pictures on the covers) which is titled:

Learning jQuery Deferreds
Taming Callback Hell with Deferreds and Promises

I offered to read and provide feedback on a pre-print version of the book (the publishing process all happens on PDFs these days) and I can say it was a great read. In fact, I have the following review of the book:

Concurrent or parallel programming is hard – REALLY hard. Like quantum mechanics, it is one of the few areas where the mark of a true expert is that they admit to NOT clearly understanding the subject.

The “deferred” is an object pattern for handling one piece of the complexity of concurrent code. It helps to bridge the gap between writing things in a linear format as if for a single-threaded computer, and writing a series of triggers that go off when events occur. Those two models are not really compatible, and that can make it quite confusing.

The jQuery library offers a “deferred” object which is deceptively simple: just a handfull of methods. It could all be explained completely in about a page of text (and IS explained that way if you read the docs). But no one who was not already an expert in the use of the “deferred” pattern could possibly use it correctly.

And that is where this book comes in. The text slowly explains what the class offers and how it functions, along with the reasons why each design detail is important. And then it presents a series of exercises of increasing complexity — all reasonable real-world examples, by the end of which you will fully understand how concurrency can be tamed (partly) with the deferred class. I am a reasonably skilled programmer (20 years experience, at least 15 with concurrent programming) and I found the pace to be about right: everything explained VERY clearly with examples (which is exactly what you want for a tricky subject no matter HOW well you know it).

If you’ve been using jQuery deferreds for a couple of years now you should probably skip this book — by this point you may be an expert. But for everyone else who thinks they might be using them, this is a great little tutorial and I recommend it highly.


CAPCHAs are those odd little boxes that show some badly malformed letters and numbers and ask you to type them in. The idea is to check whether you are a human.

The problem is that CAPCHAs are pretty difficult for humans. And they’re fairly easy for computers. There are the simple work-arounds (like paying to break CAPCHAs on Mechanical Turk). And there are the high-tech solutions where you simply build a computer that can solve them. My biggest concern though is the new kind of CAPTCHA that people have begun using. I find it to be a real problem, and it, too, can be worked around by anyone who is sufficiently motivated, but it is becoming a disturbingly common new way of identifying real humans:

Log In With Facebook


Version Control… for Servers

I wanted to pass on an excellent idea that I read from Martin Fowler‘s Blog. He calls it Immutable Servers, but I claim, if you think about it properly, it is merely the application of version control to systems administration.

Everyone understands just how much version control has transformed the development of software code. It enables developers make changes freely, rolling back changes if they need to. It enables them to look back in history and find out how things stood at any point in time, what was changed on a certain date, or when a given change was introduced. And with advanced usage, it allows “branching”, where one can experiment with a group of changes for a long time (while still working on the original branch as well) then merge them together later.server_versions

These features aren’t just for code. They are great for text documents that get edited frequently. They are a great idea for file systems. And system administrators are familiar with the idea of keeping all of their system administration scripts in a version control system. But some things are extremely difficult to put under version control. Databases are notoriously difficult to version (although Capital One 360 manages it). And servers, being pieces of physical hardware, are impossible to check into Git.

Except that they’re not. Servers are not pieces of physical hardware anymore… they were until the last decade, but in recent years that has changed. The vast majority of the servers in our data center either are run or can be run on virtual servers. The current buzzword is “cloud computing”, but whatever you call it, we have the technology to spin up and deploy servers from a template in a matter of minutes. (The fact that it takes weeks to get a server set up for your project has nothing to do with technical problems… that’s just our own failure to take full advantage of the technology that we own.)

So, given that the servers are probably running on a virtual machine anyway, it’s a good idea to keep a virtual machine template with the correct configuration (for quickly restoring the machine). Of course, if you do this you will need to update the template every time you make a significant configuration change. Updating the image doesn’t necessarily mean you launch a virtual machine each time, make the change, then save a new image — you can use tools like Puppet or Chef as part of the image deployment process so often it is just a matter of editing a configuration file.

For the final step, Martin Fowler proposes that you take this to its logical conclusion. If every change needs to be made on the real server AND on the template, why not simplify your workflow (and make it more reliable at the same time) by making the changes directly to the image and deploying a new copy each time. You never change the production server, just roll out a new one each time. This sounds crazy to anyone who hasn’t yet drunk the “cloud computing” cool-aid, to anyone for whom creating a new instance of a server takes more than a couple of minutes, but if you DO have an environment that flexible, then you might get all the benefits of version control but for servers. Netflix is one example of a major company that has taken this approach quite successfully.


I dream of Satoshi Nakamoto

bitcoin_license_plate“Satoshi Nakamoto” is the alias of the anonymous person who invented and published the protocol for Bitcoin. So far, no one knows for sure who it is, although attempts have been made to unmask the person (or people) by an analysis of their writing style and similar indicators. Now, in a blogpost, Sergio Demian Lerner has found a way to recognize coins mined by the same computer and has picked out the distinctive pattern of a certain individual who began mining almost from block one and continued mining at a consistent rate with regular restarts for a long time, without spending any of those coins.

This, he says, is Satoshi, and I applaud Sergio for this clever way to recognize an individual miner. Like Sergio, I am pleased that Satoshi’s fortune in Bitcoins is now apparently worth around $100 million USD. But Sergio also suggests that he expects this will lead to the unmasking of Satoshi once others track this to a Bitcoin somewhere which HAS been spent. (Bitcoin has many advantages, but it is NOT fully anonymous: in fact,  anyone can track a payment back to see which (anonymous) account it came from previously.)

I hope he is wrong about the unmasking. I prefer to imagine that Satoshi Nakamoto is living and working a normal job, still haunting cryptography boards in the evenings and on weekends, and occasionally checking the news to see how that Bitcoin thing is progressing. I imagine that someday, many years from now, when she dies her husband will open that envelope she left in the safe-deposit-box and it will contain a hard drive and stack of papers labeled “Now that I am gone, please publish this for the world to read.”

Okay, it’s just a romantic dream, but I’m hanging onto it as long as I can.

How NOT to do technical recruiting: Sunil Kumar of Panzer Solutions

So, “Sunil Kumar” of Panzer Solutions wrote to me a ten days ago offering a position. Normally, I appreciate hearing from recruiters. As it happens, I have no interest in a new job; I am happy with my current position and have plenty of new challenges there recently. But it is nice to hear the signs that my industry is doing well, and keeping up contacts with recruiters in my area is a good idea.

But Mr. Kumar didn’t write me about a position commiserate with my specific skills, he wrote to tell me “We have more than 100 W2 working currently with successful hit.” (That’s not quite English, but it’s fairly close.) There are recruiters who work hard to match up a particular applicant with a position where their skills and their career/environment preferences are a good fit. When I am doing the hiring (and just to note, Capital One is hiring right now in the Wilmington area), I love working with these recruiters: they bring me just 3 resumes and I end up wanting to bring in all 3 for further interviews. That’s a much more pleasant experience than digging through a stack of resumes most of whom can’t pass the FizzBuzz test.

Mr. Kumar is in a different category altogether: he clearly thinks recruiting is a numbers game: if he just sends enough applicant names to enough open positions then he’ll be successful. He won’t be, because he’s not adding value. So I politely wrote back to Mr. Kumar explaining this and asking that he not send me “blind mailing” style job offers. A week later I have received TWO other emails from Mr. Kumar stating that “Panzer Solutions is looking to hire 10-20 New H1b’s and OPT EAD’s in coming one month.” (Still, not quite English.) Besides being a violation of federal employment law (I’m not a lawyer, but I was under the impression that companies were not permitted to favor H1B holders over citizens), this is no better than spam, either for the recipient (me) or the employer to whom the names are offered.

So I am Naming and Shaming Mr. Sunil Kumar of Panzer Solutions, and I will never do business with him or his company. Here’s hoping this article jumps to the top of the search rankings for those names so that others will recognize their uselessness sooner and Panzer and Mr. Kumar can quickly go out of business and leave space for better recruiters who actually make the hiring process easier, not harder.