Circuit Breakers, Hystrix, And Dealing With Failing Back Ends

Filed Under Programming, Technology

When you are writing middleware (be it SOAP services, REST APIs, or something else) an important point to realize is that Back Ends Fail. They fail in strange and interesting ways. Code for talking to back ends should always be robust: it should never make a call without some timeout, should always be prepared for the response to be badly formatted, and should always test whether fields are valid before relying on that. (Of course, when one of these things fails it is fine for the middleware service to return an error to the front-end calling it — it just isn’t OK for the middleware service to do something like lock up a thread or crash the server.)

But despite all that defensive programming, sometimes back ends will fail in a way that causes errors. After all, the defensive programming was probably not part of your unit tests or test cases, and we all know that untested code sometimes fails. So what happens when the back end goes haywire and somehow starts bringing down your middleware?

Well, what happens is that the production support team springs into action. Situations like this are exactly why we have a team of skilled professionals who carry a beeper and provide 24/7 support for our critical systems. The monitoring that we have recognizes that a problem has occurred, the people involved either recognize the problem (“Oh, look: TSYS is acting up again!”) or they try to rapidly diagnose it (“Quick: check the Oracle connections. It’s affecting all the clusters so there’s a chance that it’s the database.”). Once they know the problem they perform rapid triage (which, honestly, is usually just to take down or reboot the affected servers) and call in the Tier-3 support team to identify the root cause and provide a fix. Often these problems are short-lived or intermittent and after a few hours things start working again.

But could we do better? What if there were a way to partially automate the effort that the production support team makes in this case? We can’t automate the judgement needed to understand the problem and to provide a fix, but we might be able to automate the process of shutting down the offending parts of the system.

That is exactly what the Circuit Breaker Pattern does. This pattern says to wrap any problematic code (like code that talks to a back-end) in some code that manages the connection. It will count the number of errors and when that exceeds some threshold it will assume that the back-end is misbehaving. Then the circuit breaker STOPS TRYING TO CALL THAT CODE. Instead, all calls will immediately return with an error. The circuit can be restored manually (after the production support team decides that things are stable again) or automatically Sample view of Hystrix Console (allow through 1 attempt ever x minutes and restore things if it works), depending on what behavior you desire.

Netflix is a company famous for their approach to building software so it is rugged, and works even in difficult circumstances. These are the folks who invented and deployed Chaos Monkey, an application that literally runs around breaking things in their data center just to keep them on their toes. And they have released a library for implementing the circuit breaker pattern. The Hystrix library is a Java implementation of the pattern and it has a good number of bells and whistles like the console you can see in the image to the right.

Within the company where I work, we have been using the Hystrix library for some time now. Since its introduction it has proved to be useful and reliable so we have been expanding its use. I definitely recommend the library for those who want an automated means of recognizing problems and shutting them off quickly (within tens of milliseconds — far faster than any production support person could possibly react) in order to limit the damage done by misbehaving systems.

What’s the “right” way to abandon an open source package?

Filed Under Programming

In the Python discussion group, Skip Montanaro posted the title question: what’s the “right” way to abandon an open source package?

He got one detailed and helpful answer from Ben Finney.

It was an excellent question and an excellent response. I thought it was worth sharing here.

Reasons Why My Code Style is Wrong

Filed Under Programming

I have had people tell me things like “You should never throw an exception to return a value in an unusual case, exceptions are only supposed to be used for error conditions.” And I HATE it when people say things like this.

There are a couple of audiences at play when we write code. There is the machine that needs to compile and then execute the code. If the code won’t compile or if it runs slowly, wastes memory, or produces incorrect results then we have absolutely failed. The other audience is future readers of the code — other developers who will need to read and maintain the code or even ourselves who will come back months later and say “Oh my God, what was I thinking!” when we read it.

But communicating with these two audiences (the computer and the reader) is all we are doing — we are not playing a game by some arbitrary set of rules. There are no “software police” who will come along and arrest us if we throw an exception, use a goto, fail to start our service name with a verb, or use a URI for our REST API that is inconsistent with some other part of the ontology. (Actually, there may be software police, but if so it is only because they have chosen to self-appoint themselves to that role.) Doing these things might make the results incorrect (for the computer) or difficult to read (for the developer) and that would be bad, but only because it was incorrect or difficult-to-understand, not because it broke some arbitrary rule.

So I would much rather hear someone say one of these things to me:

  • “Don’t throw an exception to return a value in this case because creating exceptions is slow, the situation occurs somewhat frequently, and it will reduce the performance of this function which is used in a tight loop and is therefore performance critical.”
  • “Don’t throw an exception to return a value in this case because there is error-handling code that will log the exception as an error before it is caught by the handler, thus producing spurious error reports.”
  • “Don’t throw an exception to return a value in this case because it will hide the fact that the function actually returns a value and that will be confusing to someone not familiar with it.”
  • “Don’t throw an exception to return a value because the other code in this module never does that and it will be inconsistent and therefore difficult for readers of the code.

In other words: It’s great that you’re telling me a different (or better) way to write some code, but don’t tell me to do it because that’s “the right way” (appeal to authority). Instead, tie it back to an actual benefit like correctness or readability.

Book Review: Learning jQuery Deferreds

Filed Under Programming, Reviews, Technology

I don’t often write book reviews here, but in this case I have a connection to the book. My friend Terry Jones was one of the authors of a new O’Reilly programming book (you know, the ones with the animal pictures on the covers) which is titled:

Learning jQuery Deferreds
Taming Callback Hell with Deferreds and Promises

I offered to read and provide feedback on a pre-print version of the book (the publishing process all happens on PDFs these days) and I can say it was a great read. In fact, I have the following review of the book:

Concurrent or parallel programming is hard – REALLY hard. Like quantum mechanics, it is one of the few areas where the mark of a true expert is that they admit to NOT clearly understanding the subject.

The “deferred” is an object pattern for handling one piece of the complexity of concurrent code. It helps to bridge the gap between writing things in a linear format as if for a single-threaded computer, and writing a series of triggers that go off when events occur. Those two models are not really compatible, and that can make it quite confusing.

The jQuery library offers a “deferred” object which is deceptively simple: just a handfull of methods. It could all be explained completely in about a page of text (and IS explained that way if you read the docs). But no one who was not already an expert in the use of the “deferred” pattern could possibly use it correctly.

And that is where this book comes in. The text slowly explains what the class offers and how it functions, along with the reasons why each design detail is important. And then it presents a series of exercises of increasing complexity — all reasonable real-world examples, by the end of which you will fully understand how concurrency can be tamed (partly) with the deferred class. I am a reasonably skilled programmer (20 years experience, at least 15 with concurrent programming) and I found the pace to be about right: everything explained VERY clearly with examples (which is exactly what you want for a tricky subject no matter HOW well you know it).

If you’ve been using jQuery deferreds for a couple of years now you should probably skip this book — by this point you may be an expert. But for everyone else who thinks they might be using them, this is a great little tutorial and I recommend it highly.

I dream of Satoshi Nakamoto

Filed Under Programming, Security

bitcoin_license_plate“Satoshi Nakamoto” is the alias of the anonymous person who invented and published the protocol for Bitcoin. So far, no one knows for sure who it is, although attempts have been made to unmask the person (or people) by an analysis of their writing style and similar indicators. Now, in a blogpost, Sergio Demian Lerner has found a way to recognize coins mined by the same computer and has picked out the distinctive pattern of a certain individual who began mining almost from block one and continued mining at a consistent rate with regular restarts for a long time, without spending any of those coins.

This, he says, is Satoshi, and I applaud Sergio for this clever way to recognize an individual miner. Like Sergio, I am pleased that Satoshi’s fortune in Bitcoins is now apparently worth around $100 million USD. But Sergio also suggests that he expects this will lead to the unmasking of Satoshi once others track this to a Bitcoin somewhere which HAS been spent. (Bitcoin has many advantages, but it is NOT fully anonymous: in fact,  anyone can track a payment back to see which (anonymous) account it came from previously.)

I hope he is wrong about the unmasking. I prefer to imagine that Satoshi Nakamoto is living and working a normal job, still haunting cryptography boards in the evenings and on weekends, and occasionally checking the news to see how that Bitcoin thing is progressing. I imagine that someday, many years from now, when she dies her husband will open that envelope she left in the safe-deposit-box and it will contain a hard drive and stack of papers labeled “Now that I am gone, please publish this for the world to read.”

Okay, it’s just a romantic dream, but I’m hanging onto it as long as I can.

Story Points Aren’t Accurate – That’s Why They’re Good

Filed Under Programming

Ben Northrop wrote to complain that story points are not accurate. They don’t (always) map linearly to hours spent, so adding up story points over a large project won’t accurately give hours for the project. In the spirit of expressing controversial opinions, I will agree, and explain why I think that’s a good thing.

I believe that story points serve as a “rough” estimate. In the teams I work with, story point estimates are made quickly (a few minutes to be sure we understand the story, then quickly discuss and reach a consensus estimate). They are quantized (must round off to some Fibonacci number) which means that any given estimate is necessarily imperfect.

As such, they provide a cheap (didn’t take long to generate) but rough (not perfectly accurate) estimate, and they have to be respected as such. Story point estimates would not be useful to answer questions like “Will this project deliver in October or November?”, but they ARE useful for questions like “Would this be a 3-month project or a 1 year project?” For some purposes, a more precise estimate is needed, and then it may be necessary to invest a few hours to a few weeks to perform detailed work to generate a more precise estimate. However, I think that such situations are rare: people *want* perfect estimates ahead of time but rarely *need* them. Also I think that people are usually fooling themselves: most (usually waterfall) projects with precise up-front estimates later discover that those estimates are not accurate.

One of the strengths of story points is that everyone (including the customer) REALIZES that they are rough and don’t correspond to a precise delivery date — something that can be difficult to explain for estimates expressed in hours.

Constant Crawl Design – Part 4

Filed Under Programming

Suppose you wanted to build a tool for anonymously capturing the websites that a user visited and keeping a record of the public sites while keeping the users completely anonymous so their browsing history could not be determined. One of the most difficult challenges would be finding a way to decide whether a site was “public” and to do so without keeping any record (not even on the user’s own machine) of the sites visited or even tying together the different sites by one ID (even an anonymous one). Read more

Constant Crawl Design – Part 3

Filed Under Programming

Suppose you were building a tool integrated with web browsers to anonymously capture the (public) websites that a user visited and store them to a P2P network shared by the users of this tool. What would the requirements be for this storage P2P network? Read more

Constant Crawl Design – Part 2

Filed Under Programming

Suppose you were building a tool for anonymously capture the (public) websites that a user visited. What would the UI requirements be? Read more

Constant Crawl Design – Part 1

Filed Under Programming

Do you remember Google Web Accelerator? The idea was that you downloaded all your pages through Google’s servers. For content that was static, Google could just load it once, then cache it and serve up the same page to every user. The advantage to the user was that they got the page faster, and more reliably; the advantage to Google was that they got to crawl the web “as the user sees it” instead of just what Googlebot gets… and that they got to see every single page you viewed, thus feeding even more into the giant maw of information that is Google.

Well, Google eventually dropped Google Web Accelerator (I wonder why?), but the idea is interesting. Suppose you wanted to build a similar tool that would capture the web viewing experience of thousands of users (or more). For users it could provide a reliable source for sites that go down or that get hit with the “slashdot” effect. For the Internet Archive or someone a smaller search engine like Duck Duck Go, it would provide a means of performing a massive web crawl. For someone like the EFF or human-rights groups it would provide a way to monitor whether some users (such as those in China) are being “secretly” served different content. But unlike Google Web Accelerator, a community-driven project would have to solve one very hard problem: how do this while keeping the user’s browsing history secret — the exact opposite of what Google’s project did. Read more