(Editor’s Note: Enterprise Blueprints recently ran a live event on the topic of Operational Resilience and one of the key questions that came up in the discussion was, “Can you really test for operational resilience?” Is it practical and viable to really test full end to end important business services that combine people, process, technology and physical facilities? In this post, Andrew Frith, Principal Architect of Enterprise Blueprints, explores the reality of testing operational resilience.)
Testing of Operational Resilience
Testing of Operational Resilience is a critical activity, yet can be a highly emotive topic. One of the biggest debates can be what level of testing is sufficient while balancing the risk. Any testing in production carries a level of risk. For organisations with a large estate, it can be challenging to develop the confidence to test in production against their risk appetite. The Operation Resilience regulations for financial services don’t specify what type of testing organisations need to undertake, however in a recent speech Duncan Mackinnon, Bank of England, commented:
“For high impact important business services within systemic firms, desktop testing is ultimately unlikely to be sufficient.”
Find out more HERE.
Too Risky to Test?
At the end of the day the question is if it is too risky to test, how do you have confidence it will work when needed? Desktop testing is a very valuable exercise, but as many organisations have found, more realistic testing can uncover unknown issues.
A previous organisation I worked with started doing annual data centre DR tests. All systems and services were failed over to one site, with the other site shut down and disconnected from the network. Application tests were then run across all systems to validate that everything was working correctly. At least that was the plan the first time we attempted this. As you may have guessed, it didn’t go smoothly. We uncovered hard-coded dependencies on the down site, dependencies people didn’t know existed, systems that didn’t recover to a healthy state, communication issues between teams, and a litany of other problems.
At least that exercise went better than a colleague’s of mine. Their first DR test went so well that it took them sixty hours to clean up the issues and fully recover.
Things going wrong is part of the point of these exercises. It’s better to learn and find problems in a controlled environment with a full team backing you up. That sinking feeling, standing in a data centre when a broken cable coupled with an incorrect implementation means that the rack of highly resilient storage in front of you is providing all the functionality of a space heater and you have no live copies of critical data in that site and no resilience is horrible. Knowing that you’re in a controlled situation with the right support on hand helps reduce the pressure and tendency to panic.
Plans and playbooks are great and very valuable, but unfortunately, they don’t always match reality. People, processes, and systems can fail in new and unexpected ways. Testing helps exercise all of these aspects. The goal of testing should not be to develop a scripted response for every imaginable situation but to develop a toolbox of responses and processes that can be drawn on as required to deal with whatever you’re facing. It helps people learn how to communicate and work together as a team while under stress.
A Dress Rehearsal
Testing needs to beyond the systems and the technology and must test your people and processes as well. These activities are a dress rehearsal for the main event. When things go wrong, and they will, it’s important that everyone stays calm, knows what they’re doing, and doesn’t panic. Sometimes one of the most important things in the middle of an incident is the ability to pause, step back, and calmly evaluate where you are to make the right choices.
People are critical through all of this and must be considered as part of any plans. Think about the impacts of both pressure and time on people. Six, twelve, or twenty-four hours, or more, into an incident, will have impact on decision making and mental health. Testing and rehearsing people and processes go a long way to reducing pressure and stress in the moment.
At that previous organisation I mentioned, we learned and improved. Each test got better and ran smoother. Yes, there were always issues and new problems. Each year they had a less overall impact, and we got more efficient and finding, triaging, and solving them. Testing isn’t only about finding problems; it’s practising how to solve the unknown and unexpected.
We’ll all find ourselves up the proverbial creek at some point. Personally, I want to ensure I have a map and a working engine when I find myself in that situation. Oh, and pack a couple of paddles as well – just in case.