What To Do About Fragile Systems – Part 1

Address Technical Debt before it leads to Bankruptcy

By James Wilt, Distinguished Architect

When we look at legacy systems built on out-of-support platforms in deprecated languages using dated data repositories, we can easily relate to their fragility. We’ll discuss this in Part 1 below.

However, there are also new products built on less solid foundational practices that introduce a whole new level of fragility into the solutions ecosystem. These systems are plagued by best intentions and poor execution. We’ll discuss these in Part 2 in a follow-on article

Finally, there exist fragile systems serving purposes that are well accepted under certain circumstances. Let’s learn when it’s OK to have fragility in Part 3 of that follow-on article.

Ignoring difficult situations and decisions is almost a form of art. We hope the bad we aren’t addressing will remain dormant so we never have to tackle its ramifications head-on. To be clear:

Hope is not a strategy!

The more we ignore, the deeper the hole of despair when bad happens. Let’s examine more practical approaches to deal with fragile systems to, perhaps, avoid bad altogether.

Part 1  – Legacy Systems

I was recently on a support call to renew my membership with an organization and there was a glitch where they couldn’t take my credit card and needed to reboot their legacy system as part of their standard operating procedure. OK. That happens. They took my phone number and said they would call me right back after the reboot completed. That was 6-weeks ago and still no call back. This was actually my third attempt to renew my membership as the other two also ended abruptly due to software “glitches”. Lost revenue and lost customer. Worse is that this will never be noticed because these old, fragile systems have no telemetry that track their losses! It’s all absorbed into acceptance that these things happen.

Every organization has some sort of legacy system, like this. It’s worse when they are a key dependency to other modern systems’ operation. Any original authors have moved on or retired or were consultants now long gone. These legacy systems generally require heavy manual intervention and secretly cost organizations dearly in lost revenue, lost customers, frustrated employees, and are moderate to high security risks.

A thorough analysis of the actual cost and loss from legacy systems is rarely undertaken. While maintenance and operational costs are more easily attainable, actual impact and hidden costs are much more difficult to obtain.

Discover hidden costs & fallout with suggested takeaways:

·      Cost & impact of human manual interaction to correct errors:

  • You actually have to pay people to manually fix things when legacy failure issues arise. While fixing these issues, other work is not being done (it’s a double-hit).
  • Employees fixing issues manually often become more frustrated and dissatisfied which leads to decreased morale & productivity.
  • Customer experience is negatively impacted when manual interaction is necessary. There is often some form of compensation in the form of discount or free service for their troubles which creates additional cost for the mistake being corrected.
  • Takeaway: Calculate all compounded costs, compensations, and delays associated with every manual interaction type including impact to your customer NPS.

·      Risk of unsupported platforms (hardware, OS, software, and such) subject to failure with long or no recovery options:

  • Because many legacy systems cannot be redundant, organizations often must pay for cold stand-by configurations to rehydrate when bad happens. What they fail to do, however, is regularly test recovery.
  • Backups never tested for restoration are essentially no backup at all.
  • Takeaway: Tally the cost for holding cold stand-bys at bay and run regular restoration exercises of these systems and databases to determine cost, time, and impact should rehydration become necessary. How many of your systems are actually unrecoverable?

·      Loss of revenue due to your inability to properly service customers where & how they expect to be serviced:

  • You are losing customers more often due to the many disparate and disconnected systems you force your customer facing resources to use. When you receive a poor review, you blame it on your support personnel, not your software deficiencies, and spend money training your resources to do better instead of fixing your systems which serves to further diminish morale.
  • If your support staff churn is high, consider looking into the systems they are using to interact with customers. Do these systems serve their needs or do they exacerbate their effectiveness?
  • If your legacy systems are difficult for your own resources, how much more are they for your upstream customers? Self-service might actually drive self-destruct.
  • Takeaway: Revise customer and employee satisfaction surveys to ascertain system vs. human deficiencies to understand where to focus. Monitor self-service and equate early departures to lost customers. Count & calculate this impact cost.

·      Fines & penalties where legacy systems are no longer compliant with the latest regulations:

  • There exist a number of organizations that consciously pay rather high fines as they are perceived to be cheaper than replacing the legacy software responsible for them!
  • Systems out of compliance are rarely in isolation. They often negatively affect both internal and external partner integrations.
  • Takeaway: Tally existing and potential compliance penalty costs for each legacy system and quantify all integrations affected should a legacy system fall out of compliance.

·      Fear of breaking production when touching & releasing updates to fragile legacy systems:

  • There exist production systems where access to or understanding source code is greatly compromised. Minor configuration changes are scary and anything involving code is outright frightening. Many carry up and down-stream dependencies that require hundreds of teams to work together to release.
  • Legacy deployments are often void of any rollback process. If it breaks a production rollout, that may remain broken for days to weeks — forcing manual process playbooks to be activated.
  • Many legacy systems suffer from hard-coded IP addresses deeply embedded inside their integrations. Simply moving servers to a different or dynamic network address space will break production instantly.
  • Takeaway: Build a dependency tree for each legacy system and calculate the compounded costs for weeks of down time for each ecosystem affected.

·      Lost innovation opportunities and competitive advantage because of legacy dependency anchors preventing modernization:

  • Every organization must leap onto some emerging technology based fad (from RPA bots to AI). When legacy integrations are present, you either bolt-on the new tech or run the new tech as shadow-IT.
  • Some innovations like those around Zero Trust require new ways of writing code that simply are not possible with legacy systems. Catering to their least common denominator, organizations either entirely miss out or must greatly compromise advancements in what is new and emerging.
  • My car’s oil change still requires paper forms. My carwash scans my vehicle and keeps meticulous electronic records of all my visits and tells me when I get my “free” wash! Over time, noticeable advancements in customer experiences through technology will negatively impact those who are too slow to change because of legacy anchors.
  • Takeaway: Transparently create a list of new technologies currently having industry impact that you are avoiding. Calculate the business & financial impact & benefits you’re missing by missing out.

·      Brand reputation damage/lawsuits as many legacy systems exist under security exceptions:

  • Security has one major vulnerability — legacy systems. Legacy systems literally can break any security policy and still get an “exception“, which is some mysterious get-out-of-accountability card for everyone in the org, under the premise there is no other option.
  • Reality check: exceptions never lessen the vulnerability threats they are created for. In fact, organizations are generally at greater risk for each exception because the legacy systems are most likely more
  • Where there are exceptions, there also are necessary overly restrictive security perimeters that inhibit other more modern & secure systems from interacting in more optimal ways.
  • Takeaway: Review all security exceptions and calculate the depth of impact and related costs, including necessary modern system workarounds, those which might arise from lawsuits, and impact from damage to your corporate reputation.

Whew!

Now, calculate & total these costs & potential losses. According to the Consortium for Information & Software Quality (CISQ) studies in 2018, 2020, and 2002, the cost of poor software quality in the US has grown to $2.41 Trillion, with Legacy Systems accounting for $520 Billion.

Change is difficult because we’re presented with too many “fix frameworks“, tried a few, and learned our culture, talent, maturity, sponsorship, and commitment fall short of what’s required to execute to completion. In a traditional manner, it truly is a race where only Unicorns win.

Let’s Go Non-Traditional:

·      Are you communicating the costs and risks associated with our fragile legacy systems to all proper audiences?

  • Name your most at-risk legacy systems in the “Risk Factors” section of your SEC 10K/Form 40-F or equivalent annual report identifying the financial & business impact should they fail.
  • This heightens awareness and provides fair warning should/when they fail and negatively impact your revenue stream.
  • One insurance company I admire already does this and has funds in reserve to accommodate future failure.

·      Obsolete legacy systems by modernizing connected systems instead:

  • The strangler fig and the abstraction layer patterns both attempt to modernize legacy systems by gradually replacing parts of the legacy system with new, more modern components, however, upstream dependencies and downstream tight couplings often derail any progress.
  • Focus instead on replacing/modernizing your connected systems, first. This will obsolete your legacy  systems so they can be sunset.

·      Apply Gall’s Law over big bang modernization efforts:

  • Projects with funding have a natural scope-creep through bolt-on features being added as the new necessary without additional funds, resources, and time.
  • Avoid this by attaining alignment with executive leadership to focus on one single Business Outcome for a legacy modernization initiative. Deliver only to that outcome. When bolt-ons are presented, weigh them against that one single business outcome and reject if it’s not critical path.

·      Force vendors to share the risk:

  • The flowery case studies vendors present are with environments and organizational maturities that are an impedance mismatch to yours. Vendors are often just as surprised when their product falls short!
  • Tie your purchase contracts to production delivery, operation, and performance contingencies.
  • Vendor platforms often seek to take control over your data as that’s easier than integrating with your sources of truth.
  • Architect your use of their platform to either sync only necessary metadata (in milliseconds) and link back to your source of truth, or provide push updates to your source systems.

·      Pull the plug (a.k.a. rip off the Band-Aid):

  • I am reminded regularly of systems that continue to run and produce output for which there is no audience. Heighten audits to rigorously seek these out and perform a [warm] shut-down/pause identified systems (if an audience does exist, they will surface and you can reactivate).
  • High risk systems need to reduce their vulnerability footprint before harm ensues. For gradually increasing periods, [warm] shut-down/pause your most vulnerable legacy apps/systems. Give affected consumer apps/systems two options — (1) find a new way of operating without the targeted legacy system or (2) transfer budget & resources to help provide for its replacement.

·      Better understand legacy systems with Generative AI tools:

  • Many legacy apps/systems have nobody that fully understands them — especially when the platform, language, and data environment are no longer supported.
  • Generative AI companion tools (a.k.a. co-pilots) are your new best friend. Not only can they explain legacy code, they can often find flaws in it.

Fragile Legacy Systems are the Achilles’ Heel in the pursuit for the modern enterprise. The longer you wait to aggressively address them, the more they will cost you in the long-run and in ways you can’t imagine nor measure until it’s too late.

Part 2 will appear Friday.