Practical introduction to SAP HANA Cloud Vector Engine

TiagoAlmeida

Introduction

With the latest update to SAP HANA Cloud, SAP has announced the general availability of the HANA vector engine. Now, HANA DB joins the ranks of other databases that natively support storing, retrieving, and querying vectors.

In this blog post, I will delve into the significance of this development, briefly explore the capabilities of the vector engine, and provide a few examples on how to utilize this functionality. Don't worry, the code examples will be concise and straightforward.

Why is this important?

Large language models (LLMs) are emerging from academia and finding practical applications in the real world. While ChatGPT from OpenAI is the most well-known LLM, there are thousands of others, ranging from academic implementations to state-of-the-art models.

LLMs are algorithms that take text as input and generate text as output. They can answer questions, provide instructions, write code, or even compose emails. However, despite their impressive capabilities, LLMs are not omniscient. Querying them for any information may not always yield a sensible response. To overcome this limitation, we can provide the LLM with relevant information alongside the query, allowing it to refine its output based on the provided context. This technique is known as Retrieval Augmented Generation (RAG). But how do we determine what information to feed the LLM? Our knowledge base may be significantly larger than the input size limit of the LLM's prompt.

This is where features like the HANA Vector Engine can come into play. By storing our knowledge base in HANA Vector Engine, we can retrieve text similar to the query using a similarity search function and then feed those retrievals into the LLM for further processing. In this blog post, we will focus on the storing, retrieving, and performing similarity searches in HANA Cloud DB, not how we convert text into embeddings or break down large documents, that's a topic for other blog posts ;).

HANA Vector Engine

To demonstrate the capabilities of the HANA Vector Engine, let's start by creating a table:

CREATE COLUMN TABLE "DEMO_TOKENS" (
"ID" INTEGER NOT NULL ,
"DESCRIPTION" NVARCHAR(256),
"TOKEN" REAL_VECTOR(300),
PRIMARY KEY ("ID")
)
UNLOAD PRIORITY 5 AUTO MERGE;

What's new in this release is the REAL_VECTOR(N) data type which is a data type storing an N-dimensional vector of single precision floating point numbers.

Inserting into the vector table

Inserting into this table is just like any other insert in SQL. We can use the `TO_REAL_VECTOR` function to convert a text representation of the vector components into a `REAL_VECTOR` prior to insertion. In the example below we insert a record with description `company` and the vector representation of that word as an N-dimensional vector (with N equal to 300 in this case):

INSERT INTO "DEMO_TOKENS" VALUES(
1 /*ID <INTEGER>*/,
'company' /*DESCRIPTION <NVARCHAR(256)>*/,
TO_REAL_VECTOR('[0.005068,-0.023730,-0.062008,-0.053183, ... ,0.000101,-0.010792]')/*TOKEN <REAL_VECTOR>*/);

As an example I have done a few more inserts for more tokens/words in the table.

Searching

When it comes to searching for similar terms using the HANA Vector Engine, we can leverage the power of SQL queries and two essential functions: cosine_similarity and l2distance. These functions allow us to measure the similarity between vectors and retrieve results based on their proximity.

The cosine_similarity function calculates the cosine similarity between two vectors and returns a value between 0 and 1. It would return a 1 for an exact match.

On the other hand, the l2distance function calculates the Euclidean distance between two vectors and returns a value. In this case, a value of 0 represents an exact match, while higher values indicate increasing dissimilarity.

Don't worry too much about the above definitions, this is still trivial to use. Here's three examples:

-- search for "school" - Returns school at the top
SELECT top 1 *
	from DEMO_TOKENS
	order by cosine_similarity("TOKEN",
	TO_REAL_VECTOR('[0.037047,0.011441,0.014958, ... ,0.042594,-0.023278,-0.042991,-0.005077,0.036849,0.048736]')
) desc;

-- search for "university" - Returns school as most similar
SELECT top 1 *
	from DEMO_TOKENS
	order by cosine_similarity("TOKEN",
	TO_REAL_VECTOR('[-0.015841,0.003277,0.041488,... ,-0.071284,-0.017821,-0.031682,0.048277,0.022535]')
) desc;

SELECT *
	from DEMO_TOKENS
	order by l2distance("TOKEN",
	TO_REAL_VECTOR('[-0.015841,0.003277,0.041488, ... ,-0.031682,0.048277,0.022535]')
);

Conclusion

With the SAP HANA Cloud Vector Engine generally available, there is now little need to explore alternative databases for this type of computation. SAP has provided us with a robust solution that can handle the necessary operations, making it a go-to choice for vector-related tasks in BTP.

One aspect that piques my curiosity is its performance characteristics. How well does it scale when faced with millions of tokens to search? This is something I'll aim to explore further.

Until then, thanks for reading and take care.