Five Reasons Why In-Database Machine Learning Makes Sense

By Paige Roberts

The adoption of machine learning based artificial intelligence (AI) in businesses is ramping up quickly. But building AI is not easy. Machine learning (ML) models have to be trained and when their accuracy drops, retrained on massive datasets (high velocity, volume, and variety) that must be transformed, filtered, and prepared for each ML algorithm, a process that is time-consuming, repetitive, and prone to error.

Most companies struggle up to a year to move their models into production, while an estimated 30% of businesses continue to struggle for another three months. A number of challenges are at play.

First, moving large volumes of data over distances results in high latency.

Secondly, because large data sets cannot be processed by bandwidth constraints and computational limitations of many commonly used ML technologies, only a subset of data is used to train models. This reduces the accuracy of insights and increases the risk of business decisions that are made using those insights.

Thirdly, the differences in development and production environments can require data engineers to rebuild everything from scratch that data scientists developed. This results in significant project delays.

Fourth, developing and operationalizing models in two different technical stacks requires extensive ongoing maintenance, and multiple hard-to-find skillsets. This can make AI projects very expensive.

Leveraging In-Database ML At Scale

Real-world data sources (customer databases, click-stream behavior, sensor logs, archived data lake formats, etc.) are far too complex for traditional business intelligence tools. To extract the full value of accessible data, businesses need to change the methods and technologies used for ML model creation, training, deployment, monitoring, and management.

This is where in-database ML comes in and has the potential to become a real game changer.

As the name suggests, in-database ML takes advantage of the built-in ML capabilities of an analytical database. Using in-database ML, organizations can not only achieve infrastructure simplicity (and maintenance simplicity) but there are several other reasons why this approach makes a lot of sense:

1. Reduced Data Movement

With in-database ML, data scientists can leverage the locality of the data and avoid lengthy data pipelines. When ML models use data directly from the database, there is zero transfer latency, zero data type incompatibility, and zero wait time for data to arrive in modeling environments. There is no delay for data movement since you aren’t extracting anything. ML models can be trained on massive data sets, data scientists can iterate faster, and MLOPs engineers no longer have to worry about how the system will scale.

2. Higher Performance and Accuracy

Data scientists are often forced by traditional tools to ‘down-sample’ (training models on a subset of data due to memory, bandwidth, and computational limitations). This leads to biases and inaccuracy in insights and predictions. For machine learning, training with more data will often result in a better model than even using a better algorithm. When ML models are both trained and inferenced directly in the database, performance is not limited by local compute capacity. Most analytical databases already support massively parallel processing (MPP), high data compression, smart data models, and other performance optimizations. This opens a plethora of possibilities for organizations — they can analyze anything they want and run their analytics at any scale or depth they need.

Businesses can now analyze patterns buried in large data sets resulting in faster time–to-insight and better predictions.

3. Tighter Security

Most ML models touch sensitive data, which is why security is one of the most critical aspects that should never be ignored. ML models must ideally be secure by design, which means that ML practitioners must integrate security as a core component of ML workflows. When practitioners run ML directly inside the database, they instantly benefit from battle-proven, database security practices. When one starts with security as a core pillar and then layers ML on top, they don’t have to worry.

4. Better Governance

One important aspect of MLOPs is user management. Who accesses what data? How was it modified? Who deployed which model? All such questions become a simple SQL query in an in-database ML environment. To a database, granting access on a model is the same as granting access to a table. Model governance, repeatability, and explainability are also major aspects. What data was used to train the model in production? Which features strongly influenced the model recommendations? Feature sets can be saved alongside models. Data scientists can run a simple SQL statement or view a graph to verify the live performance of a model. The entire approach makes governance one less thing to worry about.

5. Greater Democratization of ML

When it comes to analyzing a complex problem, a data scientist is needed. However, certain problems are solvable by other team members. When an expert business analyst has some basic data science knowledge, in-database ML allows them to use SQL to accomplish straightforward regressions and classifications without having to become Python or R coders. Data analysts can leverage in-database ML functionality to create datasets, models, and predictions without requiring specialized programming knowledge. This helps unlock significant value for organizations as data scientists are both scarce and expensive resources.

Aside from the long development and deployment cycles of models that do eventually make it into production, studies show that only half or less of AI models ever reach production. It is therefore highly recommended that organizations conduct a thorough due diligence and get their production data science foundation right so that ML teams can not only analyze data with speed and at scale but also be able to put their accurate models, insights, and predictions to work for their business.

About the Author

Paige Roberts is Open Source Relations Manager for Vertica, a unified analytics platform that enables predictive business insights based on a scalable architecture. With 25 years of experience in data management and analytics, she has worked as an engineer, trainer, support technician, technical writer, marketer, product manager, and consultant. She’s contributed to “97 Things Every Data Engineer Should Know,” and co-authored “Accelerate Machine Learning with a Unified Analytics Architecture” both from O’Reilly publishing. Twitter: @RobertsPaige Linkedin: https://www.linkedin.com/in/robertspaige/