Building Generative AI – Supporting Vector Data (Part One)

By Dom Couldwell, Head of Field Engineering, DataStax

Generative AI represents a significant opportunity for businesses – according to research by Accenture, 98 percent of global executives expect that AI will be essential to their companies’ strategies over the next 3-5 years, and 40 percent of all work time will be affected by generative AI. This level of opportunity will require enterprise architects and infrastructure leaders to think carefully about how they build and deliver these systems to their organisations.

Enterprises have some decisions to make about how to build and run generative AI for themselves. Some of these decisions will be obvious, but others will have long-term implications that will only be judged right or wrong over time. In order to understand these requirements, we have to look at what goes into generative AI, and how those components work together.

LLMs and tokens

Tools like ChatGPT, MidJourney and Stable Diffusion have made generative AI popular. These services have to understand what users mean and then translate those requests into instructions and data that can be used to generate more results.

Large language models, or LLMs, are essential to delivering that natural language experience. They are trained to recognise words and provide responses back. However, they are only part of the equation. Alongside LLMs, you will have to consider how you handle data so that your LLM can use it effectively.

While your data for generative AI can be stored in a database, it is not architected and used in the same way as transactional data or more traditional analytical data. Generative AI is based on tokens that are created by the LLM, where each word or theme is related to a specific mathematical value. To parse a sentence, the generative AI system looks at the tokens for each word in a sentence and then responds with what it has modelled as the most appropriate response based on those tokens.

Using the tokens in LLMs is appropriate for some use cases. For more complex situations around data, you will want to create more detailed relationships and context around the data that you use. Retraining a model with new data is expensive – Lambda Cloud estimates that it would cost OpenAI $4.6million to train GPT-3, while the cost to train GPT-4 was more than $100million according to OpenAI themselves.

Instead, you can look at using vectors to make it easier to search for and recall sets of information with context rather than looking at tokens alone. This uses the same approach to data by assigning mathematical values to items or concepts, but it then links those concepts or data together into vectors.

Vector search and data infrastructure

To prepare this data, we have to go through a process called vectorisation. This creates a new set of values for our data, linking specific characteristics to mathematical representations called embeddings for each item’s semantic meaning.  For example, imagine that you are a clothes retailer – your data on sales lines, products and performance will show what sells where, but it will not help you in creating a generative AI service that can communicate with customers. Instead, you will have to create vector embeddings for those products, which can then be stored in a vector database.

Using your product line-up, you can create a mix of different characteristics that you will track – this could include colour, size, style, descriptions, fabrics used, and so on – so that each product has its own set of vectors. This can be applied to different forms of data, from text through to images and videos. These embeddings can then be used by any service to generate responses back to user requests in natural language terms. Using vector search, a customer can ask questions to an AI agent about similarities in products, and get responses back in the same format far faster.

In essence, every natural language request is converted into an embedding that is then compared to the results in the vector database. This search then returns a result that includes a collection of products that best match your search, not just based on keywords but on a profound semantic level. Implementing a vector database will therefore be an essential part of any generative AI installation.

Vector search does not use an exact match approach, as it involves looking for patterns in your embeddings that could be suitable to return based on a prompt. Rather than exact matches, you are looking to deliver approximate nearest neighbour (ANN) results. A common algorithm to use here is a hierarchical navigable small world (HNSW) graph, which makes it easier to find similar vector matches for a given search at speed.

The ANN and HNSW algorithms are already supported in Apache Lucene, an open source search engine library, which makes it much easier to support. By combining Lucene with a database like Apache Cassandra, you can add vector search to your application and manage your data yourself. Alternatively, you can use cloud services that support Lucene and data together. The point to consider is how much data you will create over time and how many embeddings will be needed to represent that data.

Getting to the Goldilocks zone around vectors

When you are looking at vectorisation and data, the temptation is to look at getting as much detail as possible. After all, more vectors will lead to better accuracy and results, right? However, this is not correct – instead, there will be a trade-off around performance.

If you have too few characteristics covered with embeddings, then you will run the risk of providing poor results back. However, having too many embeddings can actually be worse than having too few, as it will affect the performance that you can deliver within an application. Users will want fast performance from any augmented agent or AI service; even if the results are nigh-on perfect, they won’t wait for seconds to get them through. Instead, we have to architect our approach so that we can deliver the right balance of accuracy and application performance too.

Building generative AI applications and services will involve looking at how to bring together all the right components in one place, as well as whether you want to operate your own LLMs, data and models. For enterprises that want to make use of their own data, using vector databases can help improve results and deliver them at speed where they are needed.

Dom Couldwell is Head of Field Engineering EMEA at DataStax, a real-time data and AI company. Dom helps companies to implement real-time applications based on an open source stack that just works. His previous work includes more than two decades of experience across a variety ofDom Couldwell DataStax verticals including Financial Services, Healthcare and Retail. Prior to DataStax, he has previously worked for the likes of Google, Apigee and Deutsche Bank.