Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
V_D_Vincent
Discoverer

The following are the objectives in this blog:

  • Create Embedding
  • Store Embedding in a vector database
  • Read Embeddings- understand cosine similarity with dot product and Euclidean distance

Overview of the set up and flow of the blog

V_D_Vincent_0-1712610222495.png

I. Create Embeddings

Before we create an embedding, lets understand what is an embedding and what are the models available?

Embeddings are numerical representations of concepts converted to number sequences, which make it easy for computers to understand the relationships between those concepts.

V_D_Vincent_0-1712610627295.png

Some of the models that are available with OpenAI are:

Operations

Models

Use Cases

Text similarity: Captures semantic similarity between pieces of text.

text-similarity-{ada, babbage, curie, davinci}-001

Clustering, regression, anomaly detection, visualization

Text search: Semantic information retrieval over documents.

text-search-{ada, babbage, curie, davinci}-{query, doc}-001

Search, context relevance, information retrieval

Code search: Find relevant code with a query in natural language.

code-search-{ada, babbage}-{code, text}-001

Code search and relevance

Setup done to create Embedding from Visual Studio Code

Step 1: Create an account at https://openai.com/

Step 2: Select API

Step 3: Create an API Key

V_D_Vincent_0-1712611536983.png

 

Step 4: Install visual studio code from https://code.visualstudio.com/download

V_D_Vincent_1-1712610937391.png

Step 5: Install POSTMAN in the extension of Visual studio code and set up the API reference key

V_D_Vincent_2-1712610981119.png

Step 6: Setting up and Executing the API to create the EmbeddingV_D_Vincent_0-1712611232972.png

V_D_Vincent_0-1712611867742.png

V_D_Vincent_1-1712611901692.png

The Embedding is returned as below consuming 4 tokens

{

    "object": "list",

    "data": [

        {

            "object": "embedding",

            "index": 0,

            "embedding": [

                0.006846516,

                -0.017028954,

                -0.033345513,

                -0.03129053.........

            ]

        }

    ],

    "model": "text-embedding-ada-002",

    "usage": {

        "prompt_tokens": 4,

        "total_tokens": 4

    }

}

V_D_Vincent_0-1712612007211.png

II. Store the Embedding

Embedding need a special type of database called vector databases for storing the vectors. I have used singlesource. There are many vector databases like pinecone. SAP has just announced their own SAP HANA cloud vector engine. We will subscribe to a singlestore and understand some of its component. We will create a vector database, a table and store embedding.

The set up done to store an embedding in the singlesource vector database.

Step 1: Sign up into singlesource

V_D_Vincent_0-1712612207452.png

V_D_Vincent_1-1712612245263.png

Step 2: Attaching your workspace and objects. By default after 20 minutes of inactivity all objects are suspended in singlestore, requiring you to attach them (activating them) prior to operations.

V_D_Vincent_2-1712612309484.png

V_D_Vincent_0-1712612439995.png

Step 3: Hierarchy and objects for storing the embedding

V_D_Vincent_1-1712612523506.png

Step 4: Insert the text and its embedding into the vector database with SQL editor from the development pane

V_D_Vincent_2-1712612585865.png

V_D_Vincent_3-1712612607833.png

Congratulations, now we can do various operations with the vectors. Explore if King-Man+Woman= Queen?

III. Reading the embedding

Unlike in a relation database where we search/ read for a text match like an employee pernr or material matnr, with embedding we are searching for the similarity between the vectors. Since similar meaning vectors are concentrated in a cluster within the vectordatabase.

To compute the similarity between the vectors, we will need to understand a few mathematical models. We will understand two models widely used:

Cosine similarity/ Dot Product: Where the cosine value of the angle formed by the vectors from the origin defines the similarity score

V_D_Vincent_1-1712612953408.png

Assume there are two vectors P(1) with two words – “Hello World” and P(2) with one word “Hello”.

Step 1: Tabulate the number of times each of these words occur in the phrases.

Step 1: Plot the first point x, for P(1) at x(1,1)

Step 2: Plot the second point, y for P(2) at y(0,1)

Step 3: Extend the line from the Origin(0,0) to intersect the point x and y.

Step 4: Measure the angle formed by these rays which is 45°

Step 5: Cosine value of the angle is 0.71

Note if the vector is multidimensional then the formula is used to compute the similarity value.

Euclidean distance: Where the distance between the vectors defines the similarity score

V_D_Vincent_2-1712613027559.png

Assume there are two vectors P(1) with two words – “Hello World” and P(2) with one word “Hello”.

Step 1: Tabulate the number of times each of these words occur in the phrases.

Step 1: Plot the first point x, for P(1) at x(1,1)

Step 2: Plot the second point, y for P(2) at y(0,1)

Step 3: Measure the distance between the two vectors with the Euclidean formula

Now that you understand the concepts behind the similarity value, Lets set up the system.

Step 1: The table Myvectortable in the Openaidatabase has been populated with the embeddings for the below texts

V_D_Vincent_0-1712613331005.png

Step 2: Computing the similarity value for vector-  “Hi” and “Cya” using the dot_product approach

The words “Hi” and “Cya” are not stored in the Myvectortable. The embeddings for “Hi” and “Cya” are computed from OpenAI API as in the step creating embedding. The embedding is sent as a parameter to read from the Myvectortable in the below query:

V_D_Vincent_1-1712613521061.png

The results for both the queries are:

V_D_Vincent_0-1712613678868.png

The embedding “Hi” returns a higher similarity value for “Hello world” while the embedding “Cya” returns a higher similarity value for “Bye Bye”.

Conclusion

“I am still learning.”—Michelangelo

I hope this blog helps in tapping into your curious mind and encourages you to be part of the AI journey. In relation to the big picture with reference to SAP, this blog is part of the vector engine scope.

V_D_Vincent_1-1712614025333.png

Source: SAP

Great inspiration from the below blogs-

https://community.sap.com/t5/technology-blogs-by-sap/which-embedding-model-should-i-use-with-my-corp...

https://community.sap.com/t5/technology-blogs-by-sap/lets-add-2-custom-embedding-models-to-sap-ai-co...

https://community.sap.com/t5/artificial-intelligence-and-machine-learning-blogs/sap-machine-learning...

https://community.sap.com/t5/technology-blogs-by-sap/share-corporate-info-with-an-llm-using-embeddin...

https://community.sap.com/t5/technology-blogs-by-sap/demystifying-transformers-and-embeddings-some-g...

https://community.sap.com/t5/application-development-blog-posts/editing-json-embedded-in-abap-string...

https://community.sap.com/t5/technology-blogs-by-sap/generative-ai-some-thoughts-on-using-embeddings...

https://community.sap.com/t5/technology-blogs-by-sap/how-sap-s-generative-ai-hub-facilitates-embedde...

Labels in this area