Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
former_member439824
Contributor
In this blog I will introduce how to import data from flat files to SAP HANA Cloud.
There are several methods you can use, depending on your requirements and the data size.

Overall, I would recommend to use DBeaver to manually import CSV files up to ~1GB.
Use the "IMPORT FROM" SQL statement to import larger files from Amazon S3, Azure storage, Alibaba Cloud OSS and Google Cloud Storage.
If you need to import files other than CSV or to pre-process your data, Python is the most versatile tool.
If you need to import many files based on a pattern and/or at repeated intervals, use the SAP HANA Smart Data Integration File adapter.

1. hdbtabledata


As said in this blog, when the import of the data is connected to modelling tasks in SAP Web IDE Full-Stack or SAP Business Application Studio, you can insert test small files up to 10MB directly in your database projects. Place files you want to import directly into the design-time definition of the HDI container and define the import via an .hdbtabledata HDI artefact.

This approach only supports .csv files.

Keep attention when working with Git. Most Git repositories have size limits. It is not a good practice to check in large files in a Git repository, nor to put it into a project in SAP Web IDE.

Find .hdbtabledata loading samples on Github.

2. SAP HANA Database Explorer


Learn more about the HANA Database Explorer through the official tutorial.

You can use the data import function of the SAP HANA Database Explorer to import CSV files up to 1GB as well as ESRI shapefiles from your local PC.
From the Database Explorer, you can also import CSV, PARQUET and Shapefile files from Amazon S3, Azure storage, Alibaba Cloud OSS and Google Cloud Storage.

Access the database explorer from your SAP HANA Cloud instance on the SAP Cloud Platform Cockpit.

Within the Database explorer, you can access all your HDI containers. Right click on the database/HDI container in which you want to import data, and select "Import Data".

You first need to choose the import type : Data (CSV) or ESRI Shapefile.


If you choose "Import Data", the "Import Data" wizard will open up. You can select the CSV file, then your import target table.

If you want to import files directly from Amazon S3 or Azure, you need to register the IaaS certificate into SAP HANA Cloud first. For details, refer to section '5. "IMPORT FROM" SQL statement' of this blog.


If you choose Amazon S3 as the import source in the UI, you need to set the S3 region and the S3 path, made of your IAM user access key and secret key, as well as your bucket name and object ID. Just enter them as requested by the UI.


Select the table where you want to import data.


In the import options, you can choose the file format (CSV, PARQUET), you can choose whether to use the first row as column names, delimiters and date formats.


Set the error handling, and you can start importing your file !



3. DBeaver


If you use SAP HANA Cloud as a datawarehouse, I definitiely recommed DBeaver to manage your data. DBeaver has a very intuitive data import function which lets users import data from CSV files.

First, establish a secure connection from DBeaver to SAP HANA Cloud. Then, right-click on the schema where you want to import data and select "Import data".


Follow the wizard to choose the data to import.


In this example I use a 100MB sample dataset from Kaggle with 1019925 records.


Set the importer settings.


You can manually set the table mapping, or leave it as default. The importer is able to create automatically a new table in SAP HANA Cloud for you.


These are the default columns created for my CSV file.


In the data transfer option, you can decide how often the imported data should be committed. This has an impact on the import time for large files.


Confirm that everything is correct, then start the import.


I was able to import a 100MB CSV file from my client to SAP HANA Cloud in 4 minutes with DBeaver.
As a reference, importing that same file to a SAP HANA 2.0 SPS05 instance located on my local network takes 12 seconds with DBeaver.

4. Python


If you need to pre-process data before inserting it into SAP HANA Cloud, Python offers a powerful and flexible environment.

I used the same 100MB CSV file as in the DBeaver example to compare import performance.
I started by creating a new Jupyter notebook in the directory where my CSV file is located.


In the notebook, start by importing the hana_ml library,
establish the connection to SAP HANA Cloud,
read your CSV file with pandas,
create a dataframe in the SAP HANA Cloud database containing your data,
and finally close the connection.

Here is the python code I used. The HANA ML dataframe allows developers to use data in HANA as a pandas dataframe. This means you can prepare the data to fit your needs. Learn more about an actual use case in this blog by Andreas.
!pip install hana_ml

import hana_ml
print(hana_ml.__version__)

import hana_ml.dataframe as dataframe

# Instantiate connection object
conn = dataframe.ConnectionContext(address = 'hostname.hanacloud.ondemand.com',
port = 443,
user = 'user',
password = 'password',
encrypt = 'true'
)

# Send basic SELECT statement and display the result
sql = 'SELECT 12345 FROM DUMMY'
df_remote = conn.sql(sql)
print(df_remote.collect())

import pandas as pd
df_data = pd.read_csv('NYCFT.csv', sep = ',')
df_data.head(5)

df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn,
pandas_df = df_data,
table_name = 'NYCFT_PYTHON',
force = True,
replace = False)

conn.close()

The data import finished in 9 minutes. This confirms the results of Lars Breddermann that DBeaver is around twice as fast as Python regarding data import to SAP HANA Cloud.

I used the hana_ml library to import data the simplest way I could, but you can also use other libraries such as SQL Alchemy.

5. "IMPORT FROM" SQL statement


Finally, if you have a very large dataset, the fastest way to import data to HANA Cloud is to use the built-in "IMPORT FROM" SQL statement.

Follow the official documentation to enable data import directly from your IaaS platform. For this guide, I used AWS.

First, you will need an AWS account, with S3 enabled and your files need to be uploaded to S3. I used the same file as in the DBeaver and Python examples to compare performance.

In the IAM section, create a user with API access. You will receive an Access key ID and Secret access key.

Grant the necessary access permissions to your IAM user. I granted the AmazonS3ReadOnlyAccess policy to my user, which gives access to all files on S3 for my account. You can also give a more granular access if necessary.

Now you must register the certificates necessary for the SSL connection to S3.


REST API-based adapters communicate with the endpoint through an HTTP client. To ensure secure communication, the adapters require client certificates from the endpoint services. Connections to an SAP HANA Cloud instance, which are based on the hanaodbc adapter, also require an SSL certificate.




Here are the two necessary certificates to connect to AWS S3 :
Amazon Root CA
Digicert Baltimore CyberTrust Root CA (download PEM):


Copy each root certificate text and create a certificate from this text which can be saved in a personal security environment (PSE).
Execute these SQL statements towards your SAP HANA Cloud database :

  • Create a PSE.


create pse HTTPS;


  • Create a certificate.


CREATE CERTIFICATE FROM '
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
' COMMENT <comment>;


The created certificate is now contained in the CERTIFICATES system view and can be added to a PSE store.




  • Add the certificate created above to the PSE:


SELECT CERTIFICATE_ID FROM CERTIFICATES WHERE COMMENT = '<comment>'; 
ALTER PSE <pse_name> ADD CERTIFICATE <certificate_id>;


  • Set the purpose of the newly created PSE to REMOTE SOURCE:


SET PSE <pse_name> PURPOSE REMOTE SOURCE;

This makes the usage of the certificates in the PSE store available to all remote sources.





Now that the Personal Security Environments are set up, you can import data from AWS S3.


IMPORT FROM CSV FILE 's3-<region>://<access_key>:<secret_key>@<bucket_name>/<object_id>.csv' INTO <TARGET_TABLE_NAME> [with options];




For my 100MB CSV file, the command took 34 seconds to execute.

Learn how to leverage the IMPORT FROM command from a front-end application on this blog by Naoto Sakai : Import data from Object Store to SAP HANA Cloud.

There are other methods which I did not explore to import data to SAP HANA Cloud, such as SAP HANA Smart Data Integration File adapter, hdbsql, or the JDBC/ODBC drivers.

The SAP HANA SDI File adapter is particularly useful for these use cases :

  • SharePoint access

  • SharePoint on Office365

  • Pattern-based reading; reading multiple flies in a directory that match a user-defined partition

  • Real-time file replication (only APPEND)


Learn how to Use the SDI FileAdapter to write to Azure file share in Sohit's blog.

In conclusion, each file import has its merits depending on your use case. Find the method which fits your needs.

Thank you for reading,

Maxime SIMON

Special thanks to Daisuke Ikeguchi for his support in writing this blog.
7 Comments
former_member709877
Discoverer
0 Kudos
Thanks for your sharing~~
former_member6443
Contributor
Hi Simon,

Thanks this is just what I was looking for , thanks for this.  S3 Integration into HANA Cloud.

 

Regards

Afshin
JonGooding
Advisor
Advisor
0 Kudos
Nice one Max! This will come in handy..

Thanks for writing this up!

Jon
pierre_dominique2
Contributor
0 Kudos
Hi Maxime,

Do you know if there's a way to use the IMPORT/EXPORT SQL statements with HDI containers? Every user I tried with (DBADMIN, HDI container admin, HDI container group admin etc.) don't have the EXPORT privilege and I can't find a way to grant them this privilege.

Thanks,

Pierre
former_member439824
Contributor
0 Kudos
You can find the necessary permissions on the SAP HANA Cloud SQL Reference :

Permissions


You must have the IMPORT system privilege, and the CREATE ANY privilege on the schema that is being imported in.


So you must find the schema where information is stored within your HDI container, and grant the CREATE ANY privilege to the user executing IMPORT. This user must also have the IMPORT system privilege.

On the example below, I created an HDI container called hanacloud. All the schemas below get created with that HDI Container. HANACLOUD_1 is the main schema so I think you need the CREATE ANY privilege on that one, however maybe the privilege is also required for the other technical schemas. Could you test and let us know ?

Assign privileges to your user in HANA Cockpit

VoJu
Participant
0 Kudos
Hello!

How many records can I upload with the data import function of the SAP HANA Database Explorer? I have a large file and only the first 1000 records are uploaded.

Thanks and best regards,

Julian
arun_sitaraman
Explorer
0 Kudos
Thank you very much for this nice post.

 

The following worked for me.

 

Had to make some changes from the blog post.

 

When doing pip install of hana_ml, had to use --user option.

 

Had to use the encoding 'cp1252'. Had to set the number of rows – this is static now and I can make this dynamic.

 

Had to use the schema property instead of trying to prefix a table with schema name – interestingly, there was no error from create_dataframe_from_pandas, need to debug this.

 

Anyway, this works, and we have a baseline to work with.

 

Best,

Arun

 

 

import hana_ml

print(hana_ml.__version__)

 

import hana_ml.dataframe as dataframe

 

# Instantiate connection object

conn = dataframe.ConnectionContext(address = '376afaac-b6b2-45a8-a92b-03bea1e341c5.hana.trial-us10.hanacloud.ondemand.com',

port = 443,

user = 'CSE',

password = '<mypassword>',

encrypt = 'true'

)

 

# Send basic SELECT statement and display the result

sql = 'SELECT 12345 FROM DUMMY'

df_remote = conn.sql(sql)

print(df_remote.collect())

 

import pandas as pd

import os

#

# Check to see if file exists

#

fpath="import/Sample.csv"

 

if os.path.isfile(fpath):

print("File: ", fpath, " exists!")

else:

print ("File: ", fpath, " does not exist!")

 

df_data = pd.read_csv(fpath, sep = ',', encoding='cp1252')

df_data.head(149785)

print("Number of rows: \n", df_data.head(149785))

 

df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn,

pandas_df = df_data,

table_name = 'TABLEA',

schema = 'CSE',

force = True,

replace = False)

 

 

print("Done creating dataframe in HANA CLOUD: ", df_remote.collect())

 

conn.close()