What To Do About Fragile Systems – Parts 2 & 3

Address Technical Debt before it leads to Bankruptcy

(Part one appeared here on Thursday)

By James Wilt, Distinguished Architect

When we look at legacy systems built on out-of-support platforms in deprecated languages using dated data repositories, we can easily relate to their fragility, which was discussed in Part 1.

However, there are also new products built on less solid foundational practices that introduce a whole new level of fragility into the solutions ecosystem. These systems are plagued by best intentions and poor execution. We’ll discuss these in Part 2 below.

Finally, there exist fragile systems serving purposes that are well accepted under certain circumstances. Let’s learn when it’s OK to have fragility in Part 3 below.

Ignoring difficult situations and decisions is almost a form of art. We hope the bad we aren’t addressing will remain dormant so we never have to tackle its ramifications head-on. To be clear:

Hope is not a strategy!

The more we ignore, the deeper the hole of despair when bad happens. Let’s examine more practical approaches to deal with fragile systems to, perhaps, avoid bad altogether.

Part 2  – Modern System Fragility

Many organizations are being forced into providing quick software based digital customer experiences — from  IoT immersed products such as the jewelry we wear to flashy audio/visual entertainment options to the cars we drive. Not all these organizations are prepared to produce robust software at the scale they must now deliver.

My wife’s new car has an infotainment system that adjusts everything from dashboard style, music, to seat position based on queues ranging from the key fob to phone bluetooth signature. It gets (guesses?) who’s driving right about 60% of the time. When it fails, you must manually log-in with your user id. Yes, that’s right, user id. That takes 90 seconds, on average, to load your settings. If you’re impatient (because your primo Costco parking space is being hounded by 5 other patrons) and drive off before it’s done loading, an error message appears and everything turns off. I mean everything: no navigation, no cell phone integration, no radio. Black screen.

Our home audio/visual entertainment system is no different. Just sneeze in the family room and I’m on the phone to my system’s licensed integrator scheduling a service call (at my expense) to reconfigure their software to regain volume control. This is almost a monthly event.

I consulted my favorite Generative AI duo (Bing Chat w/ChatGPT-4) for a less anecdotal example and it answered in the most unexpected way — it actually failed!

GenAIfailure

At first I thought it was being funny with me by maybe simulating a failure, but rebooting and further testing (with recorded videos) found it to be completely unresponsive and broken. While Bing Chat acknowledges no outage, ChatGPT did as shown.

Fragile.           Modern.           Systems.

No matter what your business, it will rely to some extent on software systems for it’s livelihood. Ford CEO Jim Farley said it best, “We’re not a car company anymore. We’re a software company that happens to make cars.

Now, replace his “car” with your business’ products and read that aloud.

Are you? Are you a software company that happens to do what you do? If not, then heed Wednesday Adams’ (from the Adams Family) warning, “Be afraid. Be very afraid.” Why? Because there are real software companies coming that will learn to do what you do, faster than you can learn to be a software company! Garrett Camp & Travis Kalanick of Uber and Logan Green & John Zimmer of Lyft had no prior experience in transportation and look at their impact.

In this June 8, 2023 Fully Charged podcast, Farley points toward several factors that promote fragility:

  • Legacy Business Practices — We farmed out all the modules that control the vehicles to our suppliers, because we can bid them against each other [for price]…We have about 150 modules with semiconductors all through the car. The problem is that the software is written by 150 different companies. And they don’t talk to each other.
  • Masters of Legacy Domains — I kept watching our [Internal Combustion Engine] ICE engineers try to figure out how to do over-the-air updates, or change the software for the vehicle [but] they’re not software people.
  • Legacy Architectures — It’s shocking to me how many [automakers] are sticking with very old electrical architectures and software from a confederacy [of vendors]. That will never work. No matter how many software engineers they hire, the code’s not going to work.
  • Talent Struggles — It’s difficult for legacy car companies to get software right. (Yes, he used the term “legacy car companies.”)

Lets sum this up to a propensity to persist Legacy (a leadership call) and a lack of necessary Experience (a talent call).

Legacy thinking forced onto new platforms generally results in what Farley expresses. New platforms are generally meant to be leveraged with new ways of working for good reason. Yet, we insist in forcing legacy practices on them and somehow expect better outcomes.

One justification for this poor behavior is when new platforms are built and ride on the shoulders of legacy designs that are now “hidden”. Those reluctant to change will hold to those legacy underpinnings as a justification to prevent the actual intent for the modernization. Yes, serverless functions ultimately execute on a server somewhere, but the paradigm shift/intent is to engineer and leverage them as if this new realm is made of unbounded compute that scales infinitely. A different way of thinking, designing and architecting that many legacy leaders oppose and prevent.

Talent ignorance (you don’t know what you don’t know) may be the most influential and tactical cause of modern fragility. It’s so easy to spin-up an IDE, or low-code environment, and pull together unrelated & disparate open-source snippets to quickly mash together some application that can be placed into production in short order as a minimal viable product (MVP) — void of any Non-Functional Requirements.

How is Farley addressing this? “That’s why, at Ford, we decided for our second-generation [EVs] to completely insource electrical architecture. To do that you need to write all the software yourself. But car companies haven’t written software like this, ever.” They are becoming a software company, taking accountability and responsibility for all components and integrations. To do this properly, Farley had to “split the company into three pieces” separating legacy teams & practices from modern software engineering teams & practices (Bimodal, anyone?). This also meant attracting new talent.

Let’s Promote the Ability of a System to Thrive in Adversity and Adapt to Change:

·      Set the stage for New Platforms to succeed:

  • As Farley shared, legacy players may not be the best choice for new platforms and they may likely introduce fragility.
  • Consider choosing leaders & teams who will reap the greatest benefit & empowerment to champion a new platform.
  • One financial organization I admire built their Cloud platform in this way leveraging with their software engineering discipline bi-modally isolated from their legacy infrastructure discipline. It worked.
  • Focus on learning the new ways of working that a new platform promotes over forcing your legacy ways of working upon the new platform.

·      Optimize highly connected/integrated/distributed system components

  • Hub systems that connect to everything offer countless more vectors for failure that diminish your overall reliability.
  • Your availability is not the sum of each dependency, it’s the product. If you have three system dependencies, each with an up-time of 95%, your up-time will never be better than 85.7%! If those three dependency systems increased their up-time to 99% (4% better each), your up-time will be 97% (11.3% better!).
  • Work to increase/tighten the robustness of each individual connection, even just a little, as it will greatly improve overall resilience.
  • Eliminate unnecessary connections & dependencies where they offer little business value or the risk of failure is far greater than the benefit.
  • Increase active telemetry to alert failures sooner and leverage the Circuit Breaker pattern to provide more constructive responses for dependency failures.

·      Understand patchwork designs & implementations

  • There are many benefits to leveraging and contributing to open-source code, when used and managed responsibly. However, in the interest of quickly getting to production/market, some teams will cobble together/copy snippets from mash-ups, open-source, and friends & family without proper rigor and understanding that introduces some of the hardest to debug fragilities.
  • When individual instances work but fail to scale, it’s often attributed to a lack of maturity, experience, and rigor in design & process. These need to be readily addressed through better peer reviews, testing, and best practices that promote a culture of excellence.
  • Generative AI companion/co-pilot tools can be leveraged to examine and rate “borrowed” code before it is introduced into your solutions. Thoroughly understand, evaluate, and scrutinize all harvested code as if it were your own.

·      Missed Customer Expectations

  • Fragility presents itself in forms that range from poor performance to unfortunate behaviors. Never assume your system’s fragility is in any way acceptable to your customers.
  • Somebody had to tell the emperor he had no clothes. If everyone is telling you something is just fine, seek those who see it differently and understand why.
  • One of the best ways to truly get a window into what’s fragile is to mine public support forums for your products and examine unanswered It will identify confusion in using your product and help you discover missing features & capabilities.

Fragile Modern Systems are an embarrassment and threat to the modern enterprise. They amplify deficiencies in leadership, talent, and process that diminishes confidence and loses customers.

Part 3  – When it’s OK to Build Fragile Systems

There is actually a case for Fragile Systems. In times of crisis, temporary and disposable systems which may lack rigor, security, NFRs, etc. These systems can be spun-up in days and are frequently critical for the survival of human lives, environment, and/or business continuity.

  • These are One use and Done applications that serve a single purpose for a very limited time.
  • They are active and accessible only for small windows of use and shut off.
  • This code dies. It’s never shared and not used as a seed/starter for subsequent work.
  • The risk imposed must be shadowed by the good the system serves for the short period it is being used.

In such situations, proactive crisis management teams can prepare systems with greater rigor and resilience following guidance such as, A Crisis Situations Decision-Making Systems Software Development Process With Rescue Experiences.

Temporary Fragile Systems can serve a most valuable service to the modern enterprise when they are leveraged to resolve one specific critical emergency or temporal task and are then forever abandoned.