Circuit Breakers, Hystrix, And Dealing With Failing Back Ends

Filed Under Programming, Technology

When you are writing middleware (be it SOAP services, REST APIs, or something else) an important point to realize is that Back Ends Fail. They fail in strange and interesting ways. Code for talking to back ends should always be robust: it should never make a call without some timeout, should always be prepared for the response to be badly formatted, and should always test whether fields are valid before relying on that. (Of course, when one of these things fails it is fine for the middleware service to return an error to the front-end calling it — it just isn’t OK for the middleware service to do something like lock up a thread or crash the server.)

But despite all that defensive programming, sometimes back ends will fail in a way that causes errors. After all, the defensive programming was probably not part of your unit tests or test cases, and we all know that untested code sometimes fails. So what happens when the back end goes haywire and somehow starts bringing down your middleware?

Well, what happens is that the production support team springs into action. Situations like this are exactly why we have a team of skilled professionals who carry a beeper and provide 24/7 support for our critical systems. The monitoring that we have recognizes that a problem has occurred, the people involved either recognize the problem (“Oh, look: TSYS is acting up again!”) or they try to rapidly diagnose it (“Quick: check the Oracle connections. It’s affecting all the clusters so there’s a chance that it’s the database.”). Once they know the problem they perform rapid triage (which, honestly, is usually just to take down or reboot the affected servers) and call in the Tier-3 support team to identify the root cause and provide a fix. Often these problems are short-lived or intermittent and after a few hours things start working again.

But could we do better? What if there were a way to partially automate the effort that the production support team makes in this case? We can’t automate the judgement needed to understand the problem and to provide a fix, but we might be able to automate the process of shutting down the offending parts of the system.

That is exactly what the Circuit Breaker Pattern does. This pattern says to wrap any problematic code (like code that talks to a back-end) in some code that manages the connection. It will count the number of errors and when that exceeds some threshold it will assume that the back-end is misbehaving. Then the circuit breaker STOPS TRYING TO CALL THAT CODE. Instead, all calls will immediately return with an error. The circuit can be restored manually (after the production support team decides that things are stable again) or automatically Sample view of Hystrix Console (allow through 1 attempt ever x minutes and restore things if it works), depending on what behavior you desire.

Netflix is a company famous for their approach to building software so it is rugged, and works even in difficult circumstances. These are the folks who invented and deployed Chaos Monkey, an application that literally runs around breaking things in their data center just to keep them on their toes. And they have released a library for implementing the circuit breaker pattern. The Hystrix library is a Java implementation of the pattern and it has a good number of bells and whistles like the console you can see in the image to the right.

Within the company where I work, we have been using the Hystrix library for some time now. Since its introduction it has proved to be useful and reliable so we have been expanding its use. I definitely recommend the library for those who want an automated means of recognizing problems and shutting them off quickly (within tens of milliseconds — far faster than any production support person could possibly react) in order to limit the damage done by misbehaving systems.

My Letter to the FCC on Net Neutrality

Filed Under Politics, Technology

FCC Logo - a trademark of the FCC. Used here under fair use to illustrate the fact that I am discussing communications with the FCC. The seal does not indicate that anything about this post was endorsed by the FCC.

Today is the deadline for providing public comment to the FCC on whether to classify broadband communications under Title II — basically, whether or not to enforce Net Neutrality in a way that works. The following is the letter that I submitted (by email, since the FCC’s website for posting comments was not functional):

To FCC regulators:

In relation to proceeding 14-28, I would like to express my support for regulating broadband (including wireless internet providers) under Title II (as a “common carrier”) to enable the enforcement of “net neutrality”.

By “net neutrality” I mean the principle that the delivering of internet communications should be independent of which particular entity (application or individual) is transmitting it. In other words, an internet carrier (whether broadband or traditional, wired or wireless) should not be able to block traffic to one of their customers simply because it comes from certain applications. Nor should they be able to degrade service, by offering a different speed or a different error rate for different endpoints their customers might try to reach. Regulation under Title II could achieve this.

One can imagine many ways that such discrimination could be abused. The most egregious would be if a major provider like Comcast were to interfere in the democratic process by blocking (or just degrading) access to the donation pages for certain politicians — and I think we can all agree that this is unlikely. Subtler in effect but similar in kind would be for a major player like Verizon Wireless to provide greater bandwidth (thus greater speed) for one company (Amazon.com, perhaps) rather than another (walmart.com). In this hypothetical situation Verizon might engage in this behavior because of direct kickbacks (payments) from Amazon, or as a threat to persuade Walmart to sign an unrelated contract with them. This hypothetical is far more plausible, but would be, in many ways, just as harmful. As a final example, some major internet providers such as Comcast or Verizon might intentionally manage their network so as to reduce bandwidth from certain sites their customers connect to (Netflix, perhaps) in order to demand that this third party (not the ISP’s customer) pay them additional amounts — this example is no mere hypothetical, it has HAPPENED ALREADY.

Perhaps none of this would be necessary if there were hundreds of small providers of broadband internet service giving each customer the choice of 5, 10, or more providers. In such a situation market forces might allow consumers to select from providers and choose those that did not degrade service for their favorite destinations. But that is not the world that we live in. The FCC’s own measurements (December 2013) show that two thirds of customers had access to 2 or fewer wireline broadband providers, over a quarter have only a single provider. Providing network connectivity is a natural monopoly because of the cost of placing wires (or wireless stations) and the network effects of doing so heavily in a certain location.

Imagine if, 15 years ago, internet providers had charged a “reasonable” fee for transmitting video over broadband. This would have been eminently reasonable, given that most broadband providers at that time were (and still are) in the business of selling such a service (“cable television”). Imagine that their “reasonable” rates were just 1 tenth the consumer cost of their own offerings (1 tenth the cost of a normal customer’s cable bill) — that would have seemed quite reasonable to any regulator. But under such an environment, YouTube could never have begun. YouTube introduced a completely new business model — no one had ever offered free hosting and viewing of small customer-created clips of video. Beforehand, no one could have known whether that model would have succeeded or failed, but without net neutrality it could never even have been tried. And without YouTube, we would not have things like Khan Academy and hundreds of other projects to provide training videos on every subject.

YouTube is not the last great invention; there will be new innovators in the future who will create new industries that spur our economy and benefit all of society. And although I do not know what these innovations will be, I can say with confidence that these new industries will make use of the internet. But they will only be able to do so if you impose regulations now that enforce a policy of net neutrality; if you do not, then the next YouTube will simply not occur.

Please take this advice into consideration in your rulemaking.

Sincerely,
Michael Chermside
2936 Morris Rd
Ardmore, PA 19003

 

Book Review: Learning jQuery Deferreds

Filed Under Programming, Reviews, Technology

I don’t often write book reviews here, but in this case I have a connection to the book. My friend Terry Jones was one of the authors of a new O’Reilly programming book (you know, the ones with the animal pictures on the covers) which is titled:

Learning jQuery Deferreds
Taming Callback Hell with Deferreds and Promises

I offered to read and provide feedback on a pre-print version of the book (the publishing process all happens on PDFs these days) and I can say it was a great read. In fact, I have the following review of the book:

Concurrent or parallel programming is hard – REALLY hard. Like quantum mechanics, it is one of the few areas where the mark of a true expert is that they admit to NOT clearly understanding the subject.

The “deferred” is an object pattern for handling one piece of the complexity of concurrent code. It helps to bridge the gap between writing things in a linear format as if for a single-threaded computer, and writing a series of triggers that go off when events occur. Those two models are not really compatible, and that can make it quite confusing.

The jQuery library offers a “deferred” object which is deceptively simple: just a handfull of methods. It could all be explained completely in about a page of text (and IS explained that way if you read the docs). But no one who was not already an expert in the use of the “deferred” pattern could possibly use it correctly.

And that is where this book comes in. The text slowly explains what the class offers and how it functions, along with the reasons why each design detail is important. And then it presents a series of exercises of increasing complexity — all reasonable real-world examples, by the end of which you will fully understand how concurrency can be tamed (partly) with the deferred class. I am a reasonably skilled programmer (20 years experience, at least 15 with concurrent programming) and I found the pace to be about right: everything explained VERY clearly with examples (which is exactly what you want for a tricky subject no matter HOW well you know it).

If you’ve been using jQuery deferreds for a couple of years now you should probably skip this book — by this point you may be an expert. But for everyone else who thinks they might be using them, this is a great little tutorial and I recommend it highly.

CAPTCHAs

Filed Under Security, Technology

CAPCHAs are those odd little boxes that show some badly malformed letters and numbers and ask you to type them in. The idea is to check whether you are a human.

The problem is that CAPCHAs are pretty difficult for humans. And they’re fairly easy for computers. There are the simple work-arounds (like paying to break CAPCHAs on Mechanical Turk). And there are the high-tech solutions where you simply build a computer that can solve them. My biggest concern though is the new kind of CAPTCHA that people have begun using. I find it to be a real problem, and it, too, can be worked around by anyone who is sufficiently motivated, but it is becoming a disturbingly common new way of identifying real humans:

Log In With Facebook

 

Version Control… for Servers

Filed Under Software Development, Technology

I wanted to pass on an excellent idea that I read from Martin Fowler‘s Blog. He calls it Immutable Servers, but I claim, if you think about it properly, it is merely the application of version control to systems administration.

Everyone understands just how much version control has transformed the development of software code. It enables developers make changes freely, rolling back changes if they need to. It enables them to look back in history and find out how things stood at any point in time, what was changed on a certain date, or when a given change was introduced. And with advanced usage, it allows “branching”, where one can experiment with a group of changes for a long time (while still working on the original branch as well) then merge them together later.server_versions

These features aren’t just for code. They are great for text documents that get edited frequently. They are a great idea for file systems. And system administrators are familiar with the idea of keeping all of their system administration scripts in a version control system. But some things are extremely difficult to put under version control. Databases are notoriously difficult to version (although Capital One 360 manages it). And servers, being pieces of physical hardware, are impossible to check into Git.

Except that they’re not. Servers are not pieces of physical hardware anymore… they were until the last decade, but in recent years that has changed. The vast majority of the servers in our data center either are run or can be run on virtual servers. The current buzzword is “cloud computing”, but whatever you call it, we have the technology to spin up and deploy servers from a template in a matter of minutes. (The fact that it takes weeks to get a server set up for your project has nothing to do with technical problems… that’s just our own failure to take full advantage of the technology that we own.)

So, given that the servers are probably running on a virtual machine anyway, it’s a good idea to keep a virtual machine template with the correct configuration (for quickly restoring the machine). Of course, if you do this you will need to update the template every time you make a significant configuration change. Updating the image doesn’t necessarily mean you launch a virtual machine each time, make the change, then save a new image — you can use tools like Puppet or Chef as part of the image deployment process so often it is just a matter of editing a configuration file.

For the final step, Martin Fowler proposes that you take this to its logical conclusion. If every change needs to be made on the real server AND on the template, why not simplify your workflow (and make it more reliable at the same time) by making the changes directly to the image and deploying a new copy each time. You never change the production server, just roll out a new one each time. This sounds crazy to anyone who hasn’t yet drunk the “cloud computing” cool-aid, to anyone for whom creating a new instance of a server takes more than a couple of minutes, but if you DO have an environment that flexible, then you might get all the benefits of version control but for servers. Netflix is one example of a major company that has taken this approach quite successfully.

 

Things my next phone should have

Filed Under Technology

After looking at the iPhone 5, I see people saying that phones already have everything they need… nothing new will happen. That’s completely absurd. There are tons of little things, for instance I want a browser that can do everything the PC browsers can do. But there are also HUGE changes needed. Here are some things that I want for my phone: