Table of content

Sergiu · ‎05-25-2023

Author: iatco.sergiu

"Have you ever wished to ask an AI agent questions about selected thousands of pages after you have woken up from a good sleep?" 🙂

I wrote the blog “Hello, world!” your crafted chat GPT bot!" about how to use OpenAI API to submit completions and ask further questions. The questions are restricted to the content the OpenAI was trained and that is not always enough as we want to extend the capabilities with actual online-specific content or local content.

Asking questions from additional content requires data augmentation. Let's see what is possible nowadays with OpenAI API and LlamaIndex.

Table of content

Simple content embedding
SAP Machine Learning Embedding in OpenAI
Collect HTML from URLs
Collect Notebooks with git
Collect HTML from Notebooks
Collect HTML to TXT
SAP HANA Machine Learning Challenge Embedding in OpenAI
Conclusions

Simple content embedding

I have created a class llama_context() with methods to prepare the structure of folders required for LlamaIndex, estimate the costs, create a vector index, start the query engine, and ask questions about content embedded in OpenAI. The entire code with links to resources is on GitHub 00 Simple content embedding.

class llama_context():

    def __init__(self, path=None):

	# code

    def load_data(self):

        self.documents = SimpleDirectoryReader(self.data_dir).load_data()

        print(f"Documents loaded: {len(self.documents)}.")

    def create_vector_store(self):

        self.index = GPTVectorStoreIndex.from_documents(self.documents)

        print("GPTVectorStoreIndex complete.")

    def save_index(self):

        self.index.storage_context.persist(persist_dir=self.perisit_dir)

        print(f"Index saved in path {self.perisit_dir}.")

    def load_index(self):

        storage_context = StorageContext.from_defaults(persist_dir=self.perisit_dir)

        self.index = load_index_from_storage(storage_context)

    def start_query_engine(self):

        self.query_engine = self.index.as_query_engine()

        print("Query_engine started.")

    def post_question(self, question, sleep = None):

	# code

    def del_data_dir(self):

	# code 

    def copy_file_to_data_dir(self, file_extension ='.txt', verbose = 0):

	# code

    def copy_path_from_to_data_dir(self, path_from, file_extension ='.txt', verbose = 0):

	# code

    def estimate_tokens(self, text):

	# code

    def estimate_cost(self):

	# code

# Create lct object from calls llama_context() with the working path

path_llama = "llama_mvp"

lct = llama_context(path=path_llama)

# Delete data directory

lct.del_data_dir()

# Copy files from source to data directory

path_from = "llama_mvp/source"

lct.copy_path_from_to_data_dir(path_from) # default extension *.txt

The folder contains the file mvp.txt with the content "Bogdan was born in 1990".

# Load documents

# Content "Bogdan was born in 1990"

lct.load_data()

# Vector create does embedding and costs tokens

lct.create_vector_store()



# Out:

# INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens

# > [build_index_from_nodes] Total LLM token usage: 0 tokens

# INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 7 tokens

# > [build_index_from_nodes] Total embedding token usage: 7 tokens

# GPTVectorStoreIndex complete.

# Save index

lct.save_index()

# Method load_index() costs as method create_vector_store()

# so that you don't need to upload data every time

# The index is content knowledge

lct.load_index()

Starting query engines with the content knowledge stored in the vector index. 🧠

# Start query engine

lct.start_query_engine()

We are ready to ask questions. 🧐

question = "What is content about?"

lct.post_question(question)

print(lct.response)

# Out:

# The content is about Bogdan and the year he was born.

question = "How old is he?"

# Out:

# Bogdan is 30 years old.

question = "What date is today?"

# Out:

# Today's date is August 8, 2020.

from datetime import date

today = date.today()

question = f"Consider current date {today}"

# Out:

# Consider current date 2023-05-15

# Bogdan is 33 years old.

question = "Where is the name commonly used as a given name?"

# Out:

# The name Bogdan is commonly used as a given name in Eastern European countries such as Romania, Bulgaria, and Ukraine.

As you can see the AI agent is aware of the augmented content, updated content about the current date, and further pubic content about countries.

SAP Machine Learning Embedding in OpenAI

I participated in the SAP HANA ML Challenge – Employee Churn 🤖 and I came in second place 🏆.

Now it is time to move to SAP Machine Learning Embedding of the Challenge in OpenAI. The laborious part is collecting, storing, and converting data from various sources and formats.

In this experiment, I will collect data from URLs in HTML formats and from GitHub in Notebooks in IPYNB format, then convert data to raw TXT format. 🧪⚗️💎

Collect HTML from URLs

Collected URL.

Blogs:
https://blogs.sap.com/2022/11/07/sap-community-call-sap-hana-cloud-machine-learning-challenge-i-quit...
https://blogs.sap.com/2022/11/28/i-quit-how-to-predict-employee-churn-sap-hana-cloud-machine-learnin...
https://blogs.sap.com/2022/12/22/sap-hana-cloud-machine-learning-challenge-2022-the-winners-are/

https://blogs.sap.com/2023/01/09/sap-hana-cloud-machine-learning-challenge-i-quit-understanding-metr...

Documentation:
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.04/en-US/hana_ml.dataframe.html
https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.07/en-US/pal/algorithms/hana_ml.algori...

Collection of HTML from URLs is performed with the class collect_html(). The entire code with links to resources is on GitHub 01 Collect html from urls.

import urllib.request

import os



class collect_html():

    def __init__(self):

        pass

    def read_save_html(self, url, path_save = None, filename = None, mode = 0):

        # mode: 0 - save, 1 - content, 2 - save and content

        response = urllib.request.urlopen(url)

        html_file = response.read()

        # code

        if mode == 1 or mode == 2:

            return html_file



     # code

# List html files

repo_path = path_save

list_ipynb(repo_path, "html")



# URLs content is stored in files:

# llama_challenge\html_challenge\understanding_metrics_blog.html

# llama_challenge\html_challenge\challenge_20221107.html

# llama_challenge\html_challenge\challenge_20221128.html

# llama_challenge\html_challenge\challenge_20221222.html

# llama_challenge\html_challenge\hana_ml.dataframe.html

# llama_challenge\html_challenge\hana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier.html

Collect Notebooks with git

Collected repositories from GitHub:

https://github.com/itsergiu/sapcommunity-hanaml-challenge
https://github.com/SAP-samples/hana-ml-samples

You have to install Git from here to run it in notebooks. The entire code with links to resources is on GitHub 02 Collect notebooks ipynb.

# https://github.com/SAP-samples/hana-ml-samples

# https://github.com/SAP-samples/hana-ml-samples/tree/main/Python-API/usecase-examples/sapcommunity-ha...

# folder = "Python-API\usecase-examples\sapcommunity-hanaml-challenge"



REPO_URL = "https://github.com/itsergiu/sapcommunity-hanaml-challenge"

DOCS_FOLDER = "llama_challenge/ipynb_blog"

!git clone $REPO_URL $DOCS_FOLDER



REPO_URL = "https://github.com/SAP-samples/hana-ml-samples"

DOCS_FOLDER = "llama_challenge/ipynb_hana_ml_samples"

!git clone $REPO_URL $DOCS_FOLDER

repo_path = "ipynb_blog"

list_ipynb(repo_path, "ipynb")

# ipynb_blog\SAP HANA ML challendge - CHURN  v2.3 max.ipynb

repo_path = "ipynb_hana_ml_samples/Python-API/usecase-examples/sapcommunity-hanaml-challenge"

list_ipynb(repo_path, "ipynb")

# ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\10 Connectivity Check.ipynb

# ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\20 Data upload.ipynb

# ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\PAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.ipynb

# ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\Upload and explore Employee Churn data.ipynb

repo_path = "ipynb_blog"

list_ipynb(repo_path, "ipynb")

# ipynb_blog\SAP HANA ML challendge - CHURN  v2.3 max.ipynb

Collect HTML from Notebooks

LlamaIndex provides a lot of data connectors to load dcuments from different formats in LlamaHub.AI for instance file-ipynb could be used for notebooks and file-unstructured for HTML.

I will use custom conversion. Collection (conversion) of Notebooks into HTML is performed with the class collect_ipynb(). The entire code is on GitHub 03 Collect html from notebook ipynb.

class collect_ipynb():

    def __init__(self):

        pass

    

    def ipynb_to_html(self, ipynb_file, path_save = None, encoding = None, content = False, verbose = 0):

        # verbose: 0 - Completion, 1 - Source & Destination

	# code

    def ipynb_path_to_html(self, repo_path = None, path_save = None, encoding = None, verbose = 0):

        # verbose: 0 - Complete message | 1 - Source file & Saved file



	# code

# Converted notebooks are stored in files:



# Out:

# llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\10 Connectivity Check.html

# llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\20 Data upload.html

# llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\PAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.html

# llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\Upload and explore Employee Churn data.html



# Out:

# llama_challenge\ipynb_blog\SAP HANA ML challendge - CHURN  v2.3 max.html

Collect HTML to TXT

Collection (conversion) of HTML into TXT is performed with the class collect_text().

The entire code is on GitHub 04 Collect html to txt.

class collect_text():

    def __init__(self, mask_ext = None):

	# code

    def open_html(self, html_file, encoding_read = None):

	# code

    def html_to_text(self, html_content):

	# code

    def html_to_text_file(self, html_file, path_save = None, content = False, verbose = 0, encoding_read=None, \

                           encoding_write = None):

	# code    

    def html_path_to_text(self, repo_path = None, path_save = None, encoding_read = None, encoding_write = None, verbose = 0):

	# code

# Converted files from HTML into TXT are stored in same location:

# Out:

llama_challenge\html_challenge\understanding_metrics_blog.txt

llama_challenge\html_challenge\challenge_20221107.txt

llama_challenge\html_challenge\challenge_20221128.txt

llama_challenge\html_challenge\challenge_20221222.txt

llama_challenge\html_challenge\hana_ml.dataframe.txt

llama_challenge\html_challenge\hana_ml.algorithms.pal.trees.HybridGradientBoostingClassifier.txt

# Out:

llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\readme.txt

llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\10 Connectivity Check.txt

llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\20 Data upload.txt

llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\PAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.txt

llama_challenge\ipynb_hana_ml_samples\Python-API\usecase-examples\sapcommunity-hanaml-challenge\Upload and explore Employee Churn data.txt

# Out:

llama_challenge\ipynb_blog\SAP HANA ML challendge - CHURN  v2.3 max.txt

SAP Machine Learning Challenge Embedding in OpenAI

In previous steps, content has been collected and converted to text files. Now we can use the same class llama_context() used for simple content to load data, create the index, start the engine, and ask questions.

The entire code is on GitHub 05 SAP HANA Machine Learning content embedding.

Defining the working folder and displaying the folders in object lct created from class llama_context().

# lct = llama_context(path='llama')

path_llama = "llama_challenge"

lct = llama_context(path=path_llama)



display(lct.path)

display(lct.data_dir)

display(lct.perisit_dir)

# Out:

# 'llama_challenge'

# 'llama_challenge\\data'

# 'llama_challenge\\storage'

Specifying the paths for content.

path_from1 = "llama_challenge//html_challenge"

path_from2 = "llama_challenge//ipynb_blog"

path_from3 = "llama_challenge//ipynb_hana_ml_samples//Python-API//usecase-examples//sapcommunity-hanaml-challenge"



lct.copy_path_from_to_data_dir(path_from1) # default extension *.txt

lct.copy_path_from_to_data_dir(path_from2) # default extension *.txt

lct.copy_path_from_to_data_dir(path_from3) # default extension *.txt

Converted files in text format are saved in the same folders as the source file.

# Converted files into TXT are saved in folders:

# html_challenge

# ipynb_blog

# ipynb_hana_ml_samples

Loading data from files in TXT converted from initial format HTML and IPYNB.

lct.load_data()

# Out:

# Documents loaded: 12.

Estimating the minimum and maximum costs for tokens.

lct.estimate_cost()

# Out:

# Total estimated costs with model ada: $0.0175276

# Total estimated costs with model davinci: $1.31457

Ready to create the index vector with OpenAI! 🧠

lct.create_vector_store()

# API key is required. Embedding cost tokens!

# https://platform.openai.com/account/api-keys



# Out:

# Total embedding token usage: 147741 tokens GPTVectorStoreIndex complete.



# https://platform.openai.com/account/usage

# Usage - $0.35

# text-embedding-ada-002-v2, 24 requests

# 103,950 prompt + 0 completion = 103,950 tokens

Saving index for next use.

lct.save_index()

# Out:

# Index saved in path llama_challenge\storage.

We can continue with the already in-memory index of the object lct, however, for the purpose of example index is loaded from the file saved before.

lct.load_index()

# API key is required. Loading and embedding cost tokens!

# https://platform.openai.com/account/api-keys

# Out:

# Loading all indices.

Starting query engine! Ready to ask questions! 🤖

lct.start_query_engine()

Asking general questions about SAP HANA Machine Learning Challenge - "I quit!". 🧐

question = "What is content about?"

lct.post_question(question)

print(lct.response)



# Out:

The content is about SAP HANA and its related technologies, such as SAP HANA Cloud's Auto ML 

capabilities, SAP HANA Python Client API for Machine Learning Algorithms, and SAP HANA Predictive 

Analysis Library (PAL). It also includes information about a book related to SAP HANA and a blog post 

about SAP HANA Machine Learning with ABAP Managed Database Procedures in SAP BW/4HANA.



question = "Who organized the Community Call?"

# Out:

The SAP HANA Cloud Machine Learning Challenge team organized the Community Call.



question = "What problem participants must solve?"

# Out:

Participants must solve the problem of predicting employee churn.



question = "Explain data for predicting employee churn"

# Out:

Data for predicting employee churn can include information about the employee such as their job title, years of experience, salary, performance reviews, and other factors that may influence their decision to stay or leave the company. Additionally, data can be collected from the company itself, such as the onboarding process, company culture, learning opportunities, and other factors that may influence employee churn. By analyzing this data, patterns can be identified that can help predict employee churn and inform decisions about how to improve the company's retention rate.



question = "Can you tell me which machine learning models were used by the participants?"

# Out:

The participants used the HybridGradientBoostingTree model for their machine learning.



question = "Which are the top 5 important features discoverd by the model?"

# Out:

The top 5 important features discovered by the model are: SICKDAYS, HRTRAINING, PREVIOUS_CAREER_PATH, LINKEDIN, and FUNCTIONALAREACHANGETYPE.

Asking questions for HANA Machine Learning code generation. 🧐

# QUESTION 1

question = "Python full code SAP HANA Machine learning HGBT example"

# Out:

he following code is an example of using the SAP HANA Python Client API for Machine Learning Algorithms to implement a HGBT (Hierarchical Gradient Boosting Tree) model. 



# Import the necessary libraries

import hana_ml

from hana_ml.algorithms.apl.hgbt import HGBT



# Create a connection to the SAP HANA system

connection_context = hana_ml.dataframe.ConnectionContext(address='<hostname>:<port>',

                                                        user='<username>',

                                                        password='<password>')



# Load the data into a dataframe

df = connection_context.table('<schema>.<table>')



# Create the HGBT model

hgbt = HGBT(conn_context=connection_context)



# Fit the model

hgbt.fit(data=df, key='<key_column>', label='<label_column>')



# Make predictions

predictions = hgbt.predict(data=df)



# Evaluate the model

hgbt.evaluate(data=df, label='

# QUESTION 2

question = "Python full code hana_ml dataframe example"

# Out:

In [1]:

# Import the necessary libraries

import hana_ml

import pandas as pd



# Load the CSV file into a Python object (Pandas DataFrame)

df_data = pd.read_csv(r'Emp_Churn_Train.csv', sep = ',')



# Create a connection to the HANA system

connection_context = hana_ml.dataframe.ConnectionContext(address='<HANA_SYSTEM_ADDRESS>', port=<HANA_SYSTEM_PORT>, user='<HANA_SYSTEM_USER>', password='<HANA_SYSTEM_PASSWORD>')



# Create a dataframe object from the Pandas DataFrame

df_remote = connection_context.table('EMP_CHURN_TRAIN', schema='<HANA_SYSTEM_SCHEMA>', data=df_data)



# Create training and testing set

from hana_ml.algorithms.pal import partition

hdf_train, hdf_test, hdf_val = partition.train_test_val_split( random_seed = 1017

Execute notebook in your browser with Binder.

Execute notebook 05 SAP HANA Machine Learning content embedding v1.3.1.ipynb. It uses already embedded data in vector_store. Tokens are consumed only for questions.

The notebook 05 SAP HANA Machine Learning content embedding v1.3.ipynb is the full version with collecting documents from previous steps and creation of vector_store. Creation of vector_store costs tokens

Conclusions

The results are good and promising. The response_mode is default and the data are raw text without further cleaning and preprocessing. As expected Content Embedding manages better general information than technical coding which is of higher complexity. Most probably with other LLMs results would be the same and fine-tuning could increase results marginally. However sizeable improvements require a machine learning Data-Centric approach work performed by Data Scientists in improving data quality, gathering more data, or engineering domain-specific content. What would it take in terms of resources and time to embed all the content of blogs.sap.com and GitHub SAP? 😉😊😪🤔

Enjoy! 🎈🎈🎈

Blog series

SAP HANA Cloud Machine Learning Challenge “I quit!” – Understanding metrics

Could machine learning build a model for prime numbers?

“Hello, world!” your crafted chat GPT bot!

SAP Machine Learning Embedding in OpenAI

Building Trust in AI: YouTube Transcript OpenAI Assistant

AI Challenge | Web scraping with Generative AI

ChristophMorgen · ‎08-14-2023

Great and interesting very interesting blog Sergio!

ChristophMorgen · ‎08-14-2023

Was there a specific reason you used Llamaindex over langchain?

BR, Christoph

Sergiu · ‎08-17-2023

Thanks for your appreciation and question, christoph.morgen. Initially, I started with Langchain, but I ran into the 4096 token limits of OpenAI model GPT-3.5. Looking for a solution, I found LlamaIndex, which is built on top of Langchain. LlamaIndex is widely documented and evolving fast. I discovered the Llama Hub Plugins and built the online app YouTube Transcript OpenAI Assistant, as well as the blog Building Trust in AI: YouTube Transcript OpenAI Assistant as a continuation of the series of blogs for the SAP HANA Cloud Machine Learning Challenge. Best regards, Sergiu.

marcus_schiffer · ‎09-01-2023

Hi,

is there a way to use the llama sql embedding with HANA db data similar to this example https://gpt-index.readthedocs.io/en/v0.5.27/guides/tutorials/sql_guide.html ?

Any help welcome !

Regards

Marcus

Sergiu · ‎09-01-2023

Hi,

I was thinking about that one day and the workaround I thought about was to connect to HANA db, extract data and transfer data to SQL supported by LlamaIndex.

Regards,

Sergiu

marcus_schiffer · ‎09-01-2023

Thanks for the response !

One other way could be to load the data from hana_ml dataframe to pandas dataframe and create a in memory db in python. That worked for me. So with that i can ask about e.g. our purchase orders from a natural language interface.

Sergiu · ‎09-01-2023

Great! That's the idea with Python in between. I guess the next step would be sales invoices.

SAP Machine Learning Embedding in OpenAI

Table of content

Simple content embedding

SAP Machine Learning Embedding in OpenAI

Collect HTML from URLs

Collect Notebooks with git

Collect HTML from Notebooks

Collect HTML to TXT

SAP Machine Learning Challenge Embedding in OpenAI

Conclusions

Blog series

Introduction of the end-to-end Machine Learning operations with SAP AI Core

SAP AI Core & SAP AI Launchpad - a visual introduction (Part 1)

Contextual AI – SAP’s first open source machine learning library for explainability