When It’s Okay to Fail

When It's Okay to Fail

By now, most business leaders will identify information technology as a strategic function. No longer relegated to the background as a cost-center or necessary part of doing business, IT is taking its rightful place at the leadership table.

Take for example two conservative, risk-averse industries: law and energy and power. Both of these industries are not just dependent on technology for efficiency, but in many cases, also for mission-critical decision making.

In the legal industry, artificial intelligence and machine learning are beginning to supplement the resources of a law firm’s talent pool. Young associates’ roles are shifting, and it is not as common for them to be asked to do tedious document review. Instead, they may review computer-generated reports for errors or unusual patterns.

In the energy and power industry, drones are now used to inspect and review power lines and offshore platforms, offering companies efficiency in operations and potential cost savings.

These are two industries in which failure is not an option. In law, failure to see a loophole may cost clients millions of dollars, reputational damage, or worse. In energy, failure to catch malfunction before it spirals out of control may result in structural, financial, or environmental damage, or, even worse, loss of life.

Therefore, planning for failure seems like a pessimistic approach and some may caution business leaders against creating a self-fulfilling prophecy. Planning for failure, however, is precisely what business leaders must do to achieve service excellence. The reality of technology is that it is not fail-safe. Some solutions are still being improved, and innovation is happening at a faster pace than many companies can match. With little time to test and retest, technology is prone to failure, and that won’t change especially as the complexity of IT increases.

Unfortunately, most IT professionals still focus on out-of-date thinking which is to design systems and services for best-case scenarios.

However, what if instead of considering component failures as exceptions, we considered them normal and treated the brief periods when they work fine as the exception?

REDEFINING “NORMAL”

If you were to survey the IT operations of any major company and ask the question, “What percentage of the time are all your IT systems up and running, functioning normally?” The answer from most companies – even those conservative, risk-averse entities – would be less than 15 percent of the time. What this means is that more than 85 percent of the time, something is not working as expected. Perhaps, it is time to redefine “normal.” It is important to note that this question doesn’t account for the scope or impact of an outage, just that a component, system, or process isn’t working correctly. This is an important distinction, because it’s a clue into the actual service excellence opportunity. A broken component, for example, doesn’t necessarily mean the service is unavailable.

HOW TO CREATE SERVICES DESIGNED TO FAIL (AND STILL BE OKAY)

What if companies were to implement services that were designed to fail – assuming components would break and that processing anomalies would happen? Could these services be architected in such a way that failure could occur, and repairs could be made without impacting performance and availability to users? To both questions, the answer is “yes.” It is possible to create services designed to fail and some of the leading companies in the world are doing it today. The key lesson these companies have learned is the need to mitigate critical dependencies and reduce the scope of releases (and instead have them on a continued basis) and the number of instances where a single component can impact the whole system. Redundancy is essential.

The architectures necessary to deliver services designed to fail are also designed to apply “hot fixes,” eliminating the need to take the service offline to make changes. If you can solve problems and make changes without taking the system offline, then 100 percent availability is possible.

SAFE TO FAIL IN THE REAL WORLD

To bring this concept to life, let’s use a real-world case. A good example is Netflix. In 2011, the company began experimenting with Simian Army, a “safe to fail” suite of tools that now reside within the Amazon Web Services suite of products and is a key tool in devops installations. Building the Simian Army was critical to Netflix’s ability to operate in a cloud-based environment fraught with potential interruptions while continuing to deliver reliable products and services. Some of the components of the Simian Army include Chaos Monkey, which tests for random failures; Latency Monkey, which introduces artificial delays and tests for the system’s return to normal runtime; and Conformity Monkey, which finds non-conforming instances and shuts them down even if they appear to perform well initially. Finally, Chaos Gorilla, at the top of the Simian Army hierarchy, simulates an outage of an entire Amazon availability zone to see how the Netflix system will handle it. Using these tools, Netflix and eventually other companies can still deliver on customer expectations in these unstable environments without sacrificing overall reliability and performance for the long haul.

DRIVING FOR SERVICE EXCELLENCE

Service excellence isn’t just fulfilling service-level agreements and minimum expectations. Service excellence means giving the organization, its internal and external clients, business partners, customers, and others the tools they need to move forward, and not just the minimum levels of quality and performance to coast. Modern technology architectures can enable an organization to strive for more than “good enough.” It all starts with changing the way we think about failure – treating it as the new normal and planning for it.