Overview

former_member221748 · ‎03-08-2022

This is part of the HANA Data Strategy series of BLOGS

https://blogs.sap.com/2019/10/14/sap-hana-data-strategy/

Overview

Recently as customers are moving larger and larger tables from HANA into HANA Data Lake, I am being asked what the fastest way is to move data from HANA to HANA Data Lake. Or more precisely I am asked if there is a faster way then doing a simple HANA INSERT into a HANA Data Lake virtual table.

You may be asking why customers are moving large tables from HANA to HANA Data Lake (HDL) and the most popular use case for this is an initial materialization of a large datasets or archiving older data to HDL. Most of these customers are using HANA Smart Data Integration (SDI) to do this materialization and often using the same interface for change data capture using SDI Flowgraphs or SDI real-time replication to keep these tables up to date.

See these BLOGs for more detail on HANA SDI:

HANA Data Strategy: Data Ingestion including Real-Time Change Data Capture

https://blogs.sap.com/2020/06/18/hana-data-strategy-data-ingestion-including-real-time-change-data-c...

HANA Data Strategy: Data Ingestion – Virtualization

https://blogs.sap.com/2020/03/09/hana-data-strategy-data-ingestion-virtualization/

The three simple methods of moving data that we will examine here are:

a simple HANA INSERT into a HDL virtual table,

a simple HDL INSERT from HDL accessing a HANA virtual table, and

a HANA export and HDL LOAD.

Now you might be asking:

“Why go through HANA?”

“Why not load the data directly into HDL?”

And again, these customers are using the HANA Enterprise Information Management tools which currently require a HANA object (local or virtual) as a target. In future BLOGs we’ll discuss loading data directly into HDL via IQ Client-Side Load, Data Services and Data Intelligence.

The fastest way to load data from HANA Cloud, HANA into HANA Cloud, HANA Data Lake is doing it from HDL/IQ and running an INSERT statement with a SELECT from a HANA table using the “create existing local temporary table” to create a proxy table pointing to the HANA physical table (see details below).

Method	Rows	Data Size	Time - seconds
HDL/IQ INSERT..SELECT	28,565,809	3.3 GB of data	52.86
*HDL/IQ LOAD Azure File System	28,565,809	3.3 GB of data	116 (1m 56s)
*HDL/IQ LOAD HDL File System	28,565,809	3.3 GB of data	510 (8m 30s)
HANA INSERT..SELECT	28,565,809	3.3 GB of data	1277 (21m 7s)

* Does not include the time to export the data from HANA to the file system

Using a TPC-D ORDERS table with 28,565,809 rows which is about 3.3 GB of data. Loading a small HDL configuration.

This was tested using this HANA Cloud configuration:

HANA 60GB/200GB 4 vCPU

HDL 16TB 8vCPU worker/8vCPU coordinator

In an HDL configuration more vCPUs would have allowed this to run more parallel (especially on wider tables) and by adding more TBs HDL will acquire more disk i/o throughput.

Detail configuration notes and syntax used for testing

Start Hana Cockpit to Manage SAP HANA Cloud

From Data Lake … choose Open in SAP HANA Database Explorer

You maybe prompted for your HDLADMIN password if this is the first time going here.

Enter SQL command and click to execute

HDL commands for creating:

a server connecting to HANA Cloud from HDL,

a local HDL table to load the data into and creating

a local temporary proxy table pointing to the table in the HANA Cloud instance

CREATE SERVER

--DROP SERVER DRHHC2_HDB

CREATE SERVER DRHHC2_HDB CLASS 'HANAODBC' USING 'Driver=libodbcHDB.so;ConnectTimeout=60000;ServerNode=XXXX.hana.prod-us10.hanacloud.ondemand.com:443;ENCRYPT=TRUE;ssltruststore=XXXX.hana.prod-us10.hanacloud.ondemand.com;ssltrustcert=Yes;UID=DBADMIN;PWD=XXXXX;'

CREATE TARGET TABLE

CREATE TABLE REGIONPULL (

R_REGIONKEY   bigint                  not null,

R_NAME            varchar(25)        not null,

R_COMMENT    varchar(152)      not null,

primary key (R_REGIONKEY)

);

CREATE local temporary PROXY

create existing local temporary table REGION_PROXY (

R_REGIONKEY   bigint                  not null,

R_NAME                          varchar(25)        not null,

R_COMMENT    varchar(152)      not null,

primary key (R_REGIONKEY)

)

at 'DRHHC2_HDB..TPCD.REGION';

INSERT DATA

INSERT into REGIONPULL SELECT * from REGION_PROXY;

Commit;

--1.9s

ORDERS table test commands

--DROP TABLE ORDERSPULL;

create table ORDERSPULL (

O_ORDERKEY           BIGINT               not null,

O_CUSTKEY            BIGINT               not null,

O_ORDERSTATUS        VARCHAR(1)           not null,

O_TOTALPRICE         DECIMAL(12,2)        not null,

O_ORDERDATE          DATE                 not null,

O_ORDERPRIORITY      VARCHAR(15)          not null,

O_CLERK              VARCHAR(15)          not null,

O_SHIPPRIORITY       INTEGER              not null,

O_COMMENT            VARCHAR(79)          not null,

primary key (O_ORDERKEY)

);

create existing local temporary table ORDERS_PROXY (

O_ORDERKEY           BIGINT               not null,

O_CUSTKEY            BIGINT               not null,

O_ORDERSTATUS        VARCHAR(1)           not null,

O_TOTALPRICE         DECIMAL(12,2)        not null,

O_ORDERDATE          DATE                 not null,

O_ORDERPRIORITY      VARCHAR(15)          not null,

O_CLERK              VARCHAR(15)          not null,

O_SHIPPRIORITY       INTEGER              not null,

O_COMMENT            VARCHAR(79)          not null

)

at 'DRHHC2_HDB..TPCD.ORDERS';

INSERT into ORDERSPULL SELECT * from ORDERS_PROXY;

Commit;

--59s

--52.86 s

SELECT COUNT(*) FROM ORDERSPULL;

--28,565,809

LINEITEM table test commands

create table LINEITEM (

L_ORDERKEY           BIGINT               not null,

L_PARTKEY            BIGINT               not null,

L_SUPPKEY            BIGINT               not null,

L_LINENUMBER         INTEGER              not null,

L_QUANTITY           DECIMAL(12,2)        not null,

L_EXTENDEDPRICE      DECIMAL(12,2)        not null,

L_DISCOUNT           DECIMAL(12,2)        not null,

L_TAX                DECIMAL(12,2)        not null,

L_RETURNFLAG        VARCHAR(1)              not null,

L_LINESTATUS         VARCHAR(1)              not null,

L_SHIPDATE           DATE                 not null,

L_COMMITDATE         DATE                 not null,

L_RECEIPTDATE        DATE                 not null,

L_SHIPINSTRUCT       VARCHAR(25)          not null,

L_SHIPMODE           VARCHAR(10)          not null,

L_COMMENT            VARCHAR(44)          not null,

primary key (L_ORDERKEY,L_LINENUMBER)

);

create existing local temporary table LINEITEM_PROXY (

L_ORDERKEY           BIGINT               not null,

L_PARTKEY            BIGINT               not null,

L_SUPPKEY            BIGINT               not null,

L_LINENUMBER         INTEGER              not null,

L_QUANTITY           DECIMAL(12,2)        not null,

L_EXTENDEDPRICE      DECIMAL(12,2)        not null,

L_DISCOUNT           DECIMAL(12,2)        not null,

L_TAX                DECIMAL(12,2)        not null,

L_RETURNFLAG         VARCHAR(1)              not null,

L_LINESTATUS         VARCHAR(1)              not null,

L_SHIPDATE           DATE                 not null,

L_COMMITDATE         DATE                 not null,

L_RECEIPTDATE        DATE                 not null,

L_SHIPINSTRUCT       VARCHAR(25)          not null,

L_SHIPMODE           VARCHAR(10)          not null,

L_COMMENT            VARCHAR(44)          not null

)

at 'DRHHC2_HDB..TPCD.LINEITEM';

INSERT into LINEITEM SELECT * from LINEITEM_PROXY;

Commit;

-- Rows affected:       114,129,863

-- Client elapsed time: 4 m 52 s

In Conclusion

The fastest way to load data from HANA Cloud, HANA into HANA Cloud, HANA Data Lake is doing it from HDL/IQ and running an INSERT statement with a SELECT from a HANA table using the “create existing local temporary table” to create a proxy table pointing to the HANA physical table. This can be very easily done using the commands listed in this blog or even easier by creating a procedure that will generate these commands (see Daniel’s BLOG below).

Also see:

Jason Hansberger’s Loading Data into SAP HANA Cloud, Data Lake BLOG goes into detail on how raising HDL vCPUs and database size effects load performance:

https://blogs.sap.com/2020/11/23/loading-data-into-sap-hana-cloud-data-lake/

Daniel Utvich’s Move data FAST from an SAP HANA Cloud database to a HANA Data Lake BLOG, has an example of a procedure that will generate this SQL code for you based on system table information:

https://blogs.sap.com/2022/01/14/move-data-fast-from-an-sap-hana-cloud-database-to-a-hana-data-lake/

SAP HANA Data Strategy BLOGs Index

SAP HANA Data Strategy

SAP HANA Data Strategy: HANA Data Modeling a Detailed Overview

HANA Data Strategy: Data Ingestion including Real-Time Change Data Capture

HANA Data Strategy: Data Ingestion – Virtualization

HANA Data Strategy: HANA Data Tiering

former_member204363 · ‎04-05-2022

Great post!

Your blog is really helpful for me.

Is there any way to insert much faster with options like fetching HANA Cloud DATA with parallel processing or inserting into Data Lake with partitioning?

former_member221748 · ‎04-05-2022

In a word, no, not without raising the configuration values on the HANA side or the HDL side. The INSERT using a local temporary is already doing a parallel fetch from HANA (which is only a 4 vCPU configuration - a larger HANA config may run faster.) You do not partition an IQ database. It will not help. IQ loads already stripe across all of the files in the DBSpace. Again adding more vCPUs would help. See:

Jason Hansberger’s Loading Data into SAP HANA Cloud, Data Lake BLOG goes into detail on how raising HDL vCPUs and database size effects load performance:

https://blogs.sap.com/2020/11/23/loading-data-into-sap-hana-cloud-data-lake/

I am currently doing some testing with HANA Cloud 2022 QRC1 which allows HANA to export to HDLFS in a compressed parquet format to see if export/import can run quicker.

fun fun fun

Doug

former_member204363 · ‎04-06-2022

Thanks for the quick and detailed reply!

One more question.

I faced the following error message when I tested it in our environment.

Could not execute 'insert into ZSY_GSEG_D_SEG_SELRU select * from ZSY_GSEG_D_SEG_SELRU_proxy2'
Error: (dberror) [-1013134]: Error accessing server 'JP10_DWC_PRESALES_I027845': [SAP AG][LIBODBCHDB SO][HDBODBC] General error;403 internal error: Error opening the cursor for the remote database [SAP AG][LIBODBCHDB SO][HDBODBC] General error;4 cannot allocate enough mem
-- (xtolib/xtcontext.cxx 359)

The statement "Select * from ZSY_GSEG_D_SEG_SELRU_proxy2" works fine. It shows 1,000 result as preview.

But when I run the statement "insert into ZSY_GSEG_D_SEG_SELRU select * from ZSY_GSEG_D_SEG_SELRU_proxy2;", it shows an error after about 5 min later.

What should I check to solve this issue?

Youngseol

former_member221748 · ‎04-22-2022

So HANA is an in-memory database so when you do select * from table, it is uncompressing data and converting it to a binary format that it sends to IQ. I am guessing there are significantly more than 1,000 rows in this table. How much memory is in HANA? How big is the table? Quick answer is break this into multiple select statements like select * from table where year =... one for each year/month/quarter of data.

markmumy · ‎04-22-2022

When HDLRE/IQ parses the statement to pull data from the proxy table, it attempts to parallelize the fetch process so that performance can be improved. Each parallel thread logs into HANA to pull a subset of the data. During that process, the data has to be ordered so that each thread can pull a known subset of data. In HANA, the ORDER BY clause is causing the result set to be materialized in HANA rather than just sent back to IQ. This is the process that is causing the out of memory condition in your case.

You can increase the amount of RAM that HANA has allocated to it so that the data no longer exhausts all RAM (it can be shrunk when you are done via a support ticket) or you can disable the parallelism.

When moving the data, you want to set the HDLRE query parallelism option to 1 rather than the default of 64. This will force HDLRE to open a single connection to HANA but also play nice with the HANA memory and not exhaust it.

set temporary option Max_Query_Parallelism = 1;

<data movement SQL>

set temporary option Max_Query_Parallelism =;

It is possible that the parallelism could be something higher than 1, perhaps 5 or 10 or 20. In order to determine what the maximum setting could be for the data, you would have to try various values. Also, as the number of rows in HANA change, so would this value. In short, as the amount of HANA data increases, we want to decrease this value to avoid an OOM issue.

songlin620 · ‎07-08-2022

Nice Blog! Will the above performance test result apply to mass/large amount of data load? If not, which is the fastest way to load mass data into HANA Cloud, Data Lake?

songlin620 · ‎08-05-2022

I got an Error: (dberror) [-308]: Connection was terminated. When I execute the Insert...Select... statement. Any idea?

former_member221748 · ‎08-28-2023

Yes, this is perfect for moving mass/large amounts of data but we all need to remember that HANA Cloud TEMP database/workspace is it's small amount of memory. A SELECT * in HANA will pull all of the data out of that nicely compressed columnar table and blow it up to RAW data before sending it to HDL/IQ. So as with all batch processing done in HANA we have to limit the amount of data that we are asking for so that it will fit into HANA's workspace. So many SELECT * FROM ... WHERE date between X and Y or some other WHERE clause to limit HANA memory being used.

Alternatively you can EXPORT the file from HANA Cloud to HDL FILES and then do an IQ LOAD command to load the data extremely fast from the file (use parquet format). This requires a bit more syntax and knowledge of the IQ LOAD command.

former_member221748 · ‎09-14-2023

FYI

Special characters, like . in a table name, require ; syntax:

Example if HANA table name is "DOUG.REGION", you cannot use " in the at statement but you can use ;

create existing local temporary table REGION_PROXY (

R_REGIONKEY   bigint                  not null,

R_NAME                          varchar(25)        not null,

R_COMMENT    varchar(152)      not null,

primary key (R_REGIONKEY)

)

at ‘DRHHC2_HDB;;TPCD;DOUG.REGION’;

The fastest way to load data from HANA Cloud, HANA into HANA Cloud, HANA Data Lake

This is part of the HANA Data Strategy series of BLOGS

Overview

This was tested using this HANA Cloud configuration:

Detail configuration notes and syntax used for testing

HDL commands for creating:

ORDERS table test commands

LINEITEM table test commands

In Conclusion

Also see:

SAP HANA Data Strategy BLOGs Index

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win