Scaling and “Gray Failures”

There are known and proven ways to build fault-tolerant systems (erlang! ?). An underlying issue with most of these, is that they tend to assume that components are either up or down.
Gray Failure is a real thing however — where the component is “failing” (quotes intentional), but the manifestation of this failure isn’t easily categorizable. Think “flaky I/O”, “bufferbloat”, “packet loss”, etc., things that may or may not be considered a failure based upon the impact to the user, interactions with other components, etc.

The issue with gray failures is that they increase as the system scales out — in fact, “the complex interactions, interference, and dependencies among cloud components … in a multi-tenancy environment” (•), and the emergent behavior thereof, are the perfect petri-dish for gray failures.

A common characteristic of all gray failures is Differential Observability, viz., the behavior of the component varies depending upon the observer. Think “the component isn’t functioning, but the monitoring system doesn’t see the problem”. An example should help: if your network connection is being throttled (yay, US Telcos!) resulting in low bandwidth, your connectivity tests show everything as A-OK, but your apps don’t load.

These failures need to be dealt with head on — remember, as you scale, the problem will get worse! At the very least, you should be doing the following

  • Make sure that your monitoring mimics the actual usage of the component. This may not be feasible in many cases, e.g., due to performance/cost constraints, but the closer you get, the better off you are
  • Use multiple measures — both lateral and orthogonal — to inform your observability. The gestalt should help here
  • Map temporal patterns in your measures, so that you can validate the current behavior against the past
  • Deep Learning should help in the above cases when the sheer number of measures and the volumes of data get beyond what can be feasible coded

There is an excellent writeup on this in a recent paper by Huang et al. (•), go take a look at for more!

(•) Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, by Huang et al. at

Source: Deep Learning on Medium