Introduction

enes-oerdek · ‎09-14-2022

Hello World! Recently I worked on data-science challenges of my engineering team. During this work, I needed a way to persist a Pandas DataFrame (often known from Jupyter Notebook) in SAP HANA. In this blogpost, I want so share with you how this can be accomplished. The goal is to persist a Pandas DataFrame in SAP HANA using the DataFrame's "to_sql" interface.

Introduction

Jupyter Notebook or R are often the tools of choice for data scientists. It makes using data operations, data exploration and sharing convenient. SAP HANA is the database of choice for the most important businesses around the globe. Connecting both worlds, Jupyter Notebook and SAP HANA provides an incredible potential which needs to be seized.

SAP HANA

Jürgen Müller already provided a great blogpost about SAP HANA. I want to shortly formulate what SAP HANA is also with my own words. SAP HANA is an in-memory database which provides a lot of benefits, especially for analytical use cases. It connects different aspects of databases within one database. Besides typical properties of relational databases, it also delivers properties of NoSQL databases, like column-based databases. Depending on use-case, it is possible to activate or deactivate specific properties so that you can get the best performance out of your system.

SAP HANA is the database for SAP applications. SAP S/4HANA also uses SAP HANA in background. In SAP Business Technology Platform (BTP) SAP HANA is the database of choice for persisting data. With help of tools like CAP CDS it is convenient to do data modelling and build services. Also products like SAP Data Warehouse Cloud and SAP Analytics Cloud use SAP HANA as database.

Jupyter Notebook

Example of a Notebook

Jupyter Notebook, along with the R programming language, is the tool of choice when it comes to tackling data science challenges. In Jupyter Notebook, Pandas DataFrames are often used. Jupyter Notebook actually just provides the graphical web interface, which makes interactive programming in python possible. The actual libraries are Pandas, Numpy and Matplotlib. In Pandas there is a popular class called "DataFrame" which fits the needs data scientists often have in their work. These three python libraries have good integration with each other and provide a powerful tool.

DBeaver

I use DBeaver as my graphical user interface for databases. It helps investigating the raw data and run SQL statements to generate specific outputs. DBeaver is comparable with phpmyadmin, MySQL Workbench or pgAdmin.

DBeaver provides also connectivity with SAP HANA. Connecting to a SAP HANA database is simply selecting the SAP HANA tile and entering your credentials.

Example

Let's start with the demo! I want to introduce the solution by providing a step-by-step guide. You can do these steps on your own and test it out. If you have an existing SAP HANA instance with credentials, Jupyter Notebook on your computer and DBeaver, you can clone my repository, execute the notebook and do the same steps as I do here.

Sample Dataset

The dataset I use for this demo is the population data of Mannheim. Mannheim is the city where I am from, and the city provides some of its data in its own Open Data Portal. This dataset is licensed under dl-de/by-2-0. This means it can be used commercially as well as noncommercial as long as you link to the license text and data-source.

I downloaded and used the CSV source of this dataset.

Load Data as DataFrame

The power of Jupyter Notebook becomes evident, when you want to load such datasets. The library "Pandas" provides very good interfaces to accomplish this. Reading data from machine-readable file formats like XML, JSON or CSV is a one-liner.

import pandas as pd

import numpy as np

df = pd.read_csv('bevolkerungsbestand-in-mannheim-2009-2020.csv', sep=';')

df.head()

Here you can see in the first command block the declaration of the dependencies. In the second block the CSV file is read into a DataFrame. In the third block a part of the data set is output, so that you can check if everything is correct.

This is how it should look like, after you loaded the dataset and executed the commands

Data Processing

Now the data exists within the DataFrame. It's so easy to do some data exploration. What value can we derive out of the data? Pandas DataFrames have interfaces for communicating with other well known data-science libraries like numpy and matplotlib. Merge, split, cleanup, connect with other datasets and do machine learning! Creativity is the limit for things to do at this step.

Persist Data in SAP HANA

Now comes the exciting part. I have to admit, I experimented around for a while until I figured out how it works. SAP HANA has security requirements that necessitate certain parameters. That's also the real reason I'm sharing this post: In case others are working on a similar problem, you can use this example to get your quickstart.

My requirement was to use the "to_sql" interface provided by DataFrame to persist the dataset. This way you don't have to worry about the data types and you get to the target faster.

import platform

from hdbcli import dbapi

from sqlalchemy import create_engine

from sqlalchemy.orm import sessionmaker



address=''

port='443'

user=''

password = "";

For seamless integration the libraries "hdbcli" and "sqlalchemy" are needed. You can install them with the Python package manager pip. You need to add the credentials of your HANA instance into the according variables. You need the port, address, user and password.

connection_string = 'hana://%s:%s@%s:%s' % (user,password, address, port)

ssl_args = {'encrypt': 'True', 'sslValidateCertificate':'False'}

print(connection_string)

The connection does not work without SSL. This means that one must also pass the SSL arguments along with the connection information. This happens in this example via the sqlalchemy library.

engine = create_engine(connection_string, connect_args = ssl_args)

connection = engine.connect()

df.to_sql('MannheimTest',connection,index =False, if_exists ='replace')

You create a connection and pass this to the "to_sql" command. There is a warning message which comes from sqlalchemy. This can be ignored. That's it! It is this simple to persist a DataFrame in SAP HANA.

Check persisted data

Now let us look with DBeaver, whether everything was persisted correctly. I open the according connection's table.

Voila! Hereby we could confirm that everything worked fine.

Next Steps

What I have presented here is of course only the prerequisite for many other potential use cases. For fun, I persisted the table created in the above example to the HANA instance of an SAP Data Warehouse Cloud instance and built a small data pipeline.

The use case I needed this solution for is a data pipeline where I couldn't automate the data source to SAP Data Warehouse Cloud (DWC) because it was sitting around in multiple Excel sheets and the data was changing regularly. Each time I run the Notebook, my local table in SAP DWC is updated, making this a semi-automated solution for former local datasets (e.g. in Excel, CSV). Highly relevant if you are working in a transformation project.

Graphical View in SAP Data Warehouse Cloud

Using the notebook, I perform the merge of several Excel files from my file system and persist them in the DWC HANA table. In DWC I can perform further operations using the Graphical Views or Data Flows.

Story in SAP Analytics Cloud

Finally here I visualized Mannheims population data in SAP Analytics Cloud using existing live connectors. This demonstrates a data-science use case based on sample data. I don't know if I will write a separate post for this, to go more into detail. But for now, that's it!

Conclusion

The ability to perform notebook operations on SAP HANA with Jupyter Notebook is a powerful tool, which has potential. In the context of data-science, for me it seems this potential is not explored sufficiently. Have fun exploring use cases with Jupyter Notebook and SAP HANA!

Further Read

Database and Data Management: Find latest content on SAP HANA here.

SAP HANA Questions and Answers: Ask questions or find answers here.

Blogs about SAP HANA: Find more interesting blog entries here.

I hope you liked my first blogpost on SAP Blogs. Feel free to provide me some feedback and ideas on how to improve my content. That's highly appreciated!

Feel encouraged to comment, ask questions and be curious. Did you faced similar challenges in your data-science activities? Is this content useful for you? I plan to share more about my data-science activities. Follow me on SAP Blogs for more such content.

DirkO · ‎09-14-2022

Hi Enes,

many thanks for your nice blog, well explained.

This functionallity makes a lot of fun. How much time can be saved. API´s as data source where you do not have to figure out columnname and their attributes, awesome. 🙂

I was thinking about this approach too to bring data to DWC. So it would not "harm"
a View in DWC or a Story in SAC if the underlying DB table is replaced? Correct?

Br.,

Dirk

yannmiquel · ‎09-14-2022

Hi Enes,

Thanks for the blog. It's nicely explained and will the community. The pipelines with DWC seems really interesting.

Last time I used this approach (with HANA Cloud/SAC), I reach the conclusion that it's fast but with a limitation induced by the "to_sql" function, the data types.

This way you don’t have to worry about the data types and you get to the target faster.

The NVARCHAR are converted into LOB and then cannot be used in CV (for the group by). I did prefer write a bit of code to create/update the table definition based on the dataframe dtypes (and the max length of the fields). This way the strings are NVARCH and can be used in CV again.

I think it would be nice to adapt the HANA python connector to make the data persistent, with an option to match exactly the dataframe content, instead of relying on SQL alchemy and its shortcuts to LOB.

Again, it's a nice post. The aim of my comment is just to highlight the drawback of the speed.

Best regards,

Yann

Markus_Ganser · ‎09-15-2022

Great stuff, thanks Enes!

Vitaliy-R · ‎09-15-2022

Thank you for sharing your approach using ‘hdbsql’ and ‘sqlalchemy’.

Have you checked ‘hana-ml’ package? It provides nice integration with Pandas dataframe. Here is a recent example of mine: SAP Tech Bytes: CF Python app to upload CSV files into HANA database in SAP HANA Cloud | SAP Blogs

enes-oerdek · ‎09-19-2022

Hey Dirk,

thank you! Yes absolutely. The context of this use case is actually also rapid prototyping where you just want to see results in the shortest amount of time.

I was thinking about this approach too to bring data to DWC. So it would not "harm"
a View in DWC or a Story in SAC if the underlying DB table is replaced? Correct?

Very good remark! I just tested it out to confirm. Adding a new columns results in no change in DWC. You will see the new column only after reimporting and deploying the local datasets.

Removing existing columns results in an an error "invalidated view":

invalidated view: MannheimTest: line 11 col 6 (at pos 365), Code: 391, SQL State: HY000

In the local table it is possible to reimport and deploy tables so that the underlying structure in DWC is also updated. The depending views also need to be redeployed, in order to work back again.

Kind regards

Enes

enes-oerdek · ‎09-19-2022

Hey Yann,

Last time I used this approach (with HANA Cloud/SAC), I reach the conclusion that it's fast but with a limitation induced by the "to_sql" function, the data types.

Yes you are absolutely right. Actually in my little prototype in the post, the second step I do is exactly this data conversion to make it possible to consume the fields later in SAC.

This is in my view still ok. DWC and SAC are a low-code/no-code environments to scale data-science activities. In context of enabling non-developers for these activities it is a value in itself. But yes, it would even be better, if we could automatically choose the right data-types for persisting.

I think it would be nice to adapt the HANA python connector to make the data persistent, with an option to match exactly the dataframe content, instead of relying on SQL alchemy and its shortcuts to LOB.

Absolutely! That would be even better.

Kind regards

Enes

enes-oerdek · ‎09-19-2022

Thank you Markus!

enes-oerdek · ‎09-19-2022

Hey Witalij,

thanks a lot for providing this article. Really interesting package. Actually the approach in the blogpost is very similiar to this approach. Thanks for writing that blogpost! I also want to link to this article:

https://blogs.sap.com/2020/03/11/quickly-load-covid-19-data-with-hana_ml-and-see-with-dbeaver/

I want to check out also the approach you provided.

Kind regards

Enes

yannmiquel · ‎09-19-2022

Thanks for the complement. I discover with your article this DWH pipelines capabilities, very interesting

sushantp123 · ‎10-04-2022

This is awesome. I appreciate this approach and would love to see more such use cases of integrating Python with DWC. 🙂

Jupyter Notebook and SAP HANA: Persisting DataFrames in SAP HANA

Introduction

SAP HANA

Jupyter Notebook

DBeaver

Example

Sample Dataset

Load Data as DataFrame

Data Processing

Persist Data in SAP HANA

Check persisted data

Next Steps

Conclusion

Further Read

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win