Should Data Engineers be Domain Competent?

By Shammy Narayanan

Data and H1NI have one thing in common, both are scary. Similar to the rush to get vaccinated, enterprises are in an unforgiving haste to assemble a data team with a Mid-Asian dream of monetizing their data. The “Influencers” engine is working overtime and fueling this golden dream, but the question we refuse to ask ourselves is “Are we building the data team?” Does our data strategy have sanity in-built into it? To answer the question, we need not look outward and have a candid conversation with our teams and a peep into the pile of production issues will spotlight our fundamentally flawed approach.

A traditional data engineer views a table with one million records as relational rows that must be crunched, transported and loaded to a different destination. In contrast, an application programmer approaches the same table as a set of member information or pending claims that impact life. The former is a pureplay, technical view, while the latter is more human-centric. These drastically differing lenses form the genesis of the data siloes

Let’s start with a common data incident. How many times have we witnessed a dilapidating performance due to an inconsistent index? Certainly, for a long-timer in IT, the response is a non-zero integer. It’s because the indices are built on a set of columns perceived as vital by the DBAs with scant regard for the “True” application access pathways. This builds up to gradual performance degradation requiring re-indexing after a series of nagging customer complaints about the slowness. Isn’t this scenario a powerful testament to how lacking a basic domain can profusely lead to distressed customers?

Graduating further, take a closer look at the existing partitioning strategy of your critical databases. I bet my paycheck that 90% of the table partition will be based on the date column rather than on the access parameters. Such a mindless, bookish strategy drains the spool and renders the application irresponsive when a join is executed. Such a scenario is the worst nightmare of any data leader, it’s not an endearing experience to be on-spot when a disgruntled executive is looking over the shoulders and performing a dual act of fixing the failing performance while providing a running commentary in parallel. Could we have built it right the first time? not until the data analysts understand the usage pattern. Deploying and celebrating such an ill-designed application is like claiming success in surgery when the patient is lying dead on the table.

The same ignorance gets carried over to the data transformation/processing. Often, the load balancers are configured to balance the incoming data load. This simple approach works fine as long as the data source is homogeneous, however, in real-time we have data from heterogeneous sources with conflicting and varying priorities. In such instances, our approach fails to categorize and prioritize the criticality. For example, records used for MIS reporting can wait compared to a transaction waiting for pre-authorization. Instead, first-in, first-out data can be processed by clusters configured on the criticality of the functional transactions. Such smartness in data ontology has to be inbuilt and it can be done only by a team that understands the domain. On any given day, a low-performing smart pipeline is far better than a high-throughput dumb one. I can keep enumerating myriad of such use cases ranging from inefficient APIs and incompetent data invalidation strategies to miserable database locks. All these testaments are not the product of technical incompetencies but the direct impact of the flawed strategy to isolate data teams and treat them as pureplay “Technical powerhouse.”

When we advocate domain knowledge, let’s not relegate it to a few business analysts who are tasked to translate a set of high-level requirements into user stories. Rather domain knowledge implies that every data engineer gets a grip on the intrinsic understanding of how functionality flows and what it tries to accomplish. Of course, this is easier to preach than practice, as expecting a data team to understand thousands of tables and millions of rows is akin to expecting them to navigate a freeway in peak time on the reverse gear with blindfolds. It will be a disastrous.

While its amply evident that data teams need domain knowledge, it’s hard to expect that centralized data teams will deliver efficient results. Embedding a data team as part of an application team appears to be the most viable solution. This is where the concept of Data Mesh is fast evolving, and its sexiness is seducing the enterprises. The next wave of maturity is to move cautiously and swiftly from a centralized mode to a federated model where data teams are de-centralized. Yet, strategical layers such as data governance, security and compliance stay under a common umbrella. Will this be the silver bullet for all our problems? We hope and sincerely wish so, but we cannot guarantee it. As data and analytics evolve from the dark underbelly of the IT landscape, we are in to witness more such surprises and twists convoluting this complicated maze. For engineers like me, such a whirlwind is what makes working in data an exciting and exuberant challenge.

Shammy Narayanan, is a Practice Head for Data and Analytics in a Healthcare organization, with 9x cloud certified he is deeply passionate about extracting value from the data and driving actionable insights. He can be reached at