Test data generator for SAP HANA for large initial...

Sefan_Linders · ‎06-11-2019

Introduction

Recently I ran a scenario on the SAP HANA database for which I needed a large initial data set for a specific table, and another large delta data set consisting of deletions, updates and inserts on the initial data set. I googled something like “data generator for HANA” but didn’t find what I was looking for. Therefore, I wrote a database procedure to fill my table with the data that I needed. Because others might find the script useful too, I extended the script for easy re-use. The result you can find in this blog.

The procedure inserts, updates, or deletes data of a single table, using mostly set-based operations for high-performance. Inserts and updates are based on the field definition of the provided table and contain randomized data. For example, for an integer field, a random integer value is inserted, and for a character-based field, a random string of characters is inserted. The procedure uses the HANA metadata tables to retrieve the table definition. Record updates or deletions are randomized based on the first column of the primary key.

On my little test system, I achieved an insert rate of 9 million records per minute for a table with just a few fields. That was a lot faster than the first version of the script, which was not set-based.

How to run the data generator procedure

Follow below steps to generate data for your table

Dowload the code in this Git repository and copy it into the SQL console

Create a sequence with following statement: "CREATE SEQUENCE GENERATEDATASEQ1". A sequence with that name is used in the procedure.

Call the procedure using the following statement:

CALL "GenerateData" (

  IP_SCHEMA_NAME => '<SCHEMA_NAME>',

  IP_TABLE_NAME => '<TABLE_NAME>',

  IP_INSERT_NUM_RECORDS => <Number of records to be inserted>,

  IP_MAX_STRING_SIZE => <Maximum inserted string size. Strings are automatically limited on field length>,                

  IP_DELETE_NUM_RECORDS => <Number of records to delete>,

  IP_UPDATE_NUM_RECORDS => <Number of records to update>

);

Limitations

Keep in mind the following limitations, or adjust the code to work around them:

Only supports tables with a primary key. The primary key should contain at least one column that can hold integer values. Binary data types are not supported at all.

Created with HANA 2, SPS04, not guaranteed to work with older service packs.

The random generated values make compression difficult. Therefore, you will likely see minimal compression that is not to be compared with real-life use cases.

Example

As an example, a table is created with the following statement:

CREATE SCHEMA GENERATEDATA;

SET SCHEMA GENERATEDATA;

CREATE TABLE T1S (

  A INTEGER, 

  B NVARCHAR(5), 

  C DATE, 

  D TIMESTAMP, 

  PRIMARY KEY (A)

);

To insert two records into that table, the procedure is called as follows:

CALL "GenerateData"( 

  IP_SCHEMA_NAME => 'GENERATEDATA',                                                                

  IP_TABLE_NAME => 'T1S',                                                                

  IP_INSERT_NUM_RECORDS => 2,                                                                

  IP_MAX_STRING_SIZE => 3,                                                                

  IP_DELETE_NUM_RECORDS => 0,                                                                

  IP_UPDATE_NUM_RECORDS => 0

);

A select * on that table yields the following result:

Explanation of the code

In case you want to adjust the code or just understand what happens, an example table definition and procedure call is provided in the header section of the procedure. Throughout the code, output of (parts of) the dynamic SQL statements are shown, so it is easier to understand the, sometimes, long statements.

Troubleshooting

In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call.

Conclusion

This blog provided you with a procedure to generate initial and delta data sets. Since the inserts are set-based, it should insert, update and delete at reasonable speeds. There are some limitations, and the code is not perfect, so if you got any improvements, feel free to share them in the comments.

zachi85 · ‎03-27-2020

Hi Stefan,

Thanks for this really helpful post!

I built a simple shell script which executes your procedure 20 times to create 20 tables with 50 million records each with a user I called GENDATA. With the below configuration I got 4.7GB per table in memory and on disk and each run took about 15 minutes (so it took roughly 3 minutes to generate 1GB).

Maybe this is useful for someone else, so I leave it here:

wrapper.sh

n=1
while [ $n -le 20 ]
do
echo “=====Run: $n=====”
hdbsql -U GENDATA “DROP SEQUENCE GenerateDataSeq1;”
hdbsql -U GENDATA “CREATE SEQUENCE GenerateDataSeq1;”
hdbsql -U GENDATA “CREATE TABLE TABLE$n (A INTEGER, B NVARCHAR(20), C DATE, D TIMESTAMP, E NVARCHAR(20), F NVARCHAR(20), PRIMARY KEY (A) );”
hdbsql -U GENDATA “CALL \”GenerateData\” (IP_SCHEMA_NAME => ‘GENDATA’, IP_TABLE_NAME => ‘TABLE$n’, IP_INSERT_NUM_RECORDS => 50000000, IP_MAX_STRING_SIZE => 20, IP_DELETE_NUM_RECORDS => 0, IP_UPDATE_NUM_RECORDS => 0);”
n=$(( n+1 ))
done

staerk · ‎07-30-2020

Hi,

nice work, I can see that this all will work, but the first sentence struck me: Paste it into the SQL Console. I do not have HANA Studio nor HANA Cockpit, where is the SQL Console?

thanks

Thorsten

staerk · ‎07-31-2020

found the answer, if you do not have an SQL Console, you can download the code from github as text file myfile.txt and run it in a sidadm shell like this:

hdbsql -u SYSTEM -p mypassword -I myfile.txt

Great, I created my first stored procedure 🙂

staerk · ‎07-31-2020

Works for me, thanks a lot!

Users who do not have graphical tools can use hdbsql. To query the table, use

hdbsql -u SYSTEM -p mypassword "select * from GENERATEDATA.T1S"|cat

staerk · ‎07-31-2020

works, thanks a ton! To create a user GENDATA you must

create user GENDATA password passW0rd

Then hdbsql -U GENDATA will not allow for queries, but you do not need to use GENDATA to call the script, just its schema must exist. So after creating the user GENDATA, you can use hdbsql -u SYSTEM -p mypassword instead of hdbsql -U GENDATA.

former_member752226 · ‎06-17-2021

Downloaded the code from git hub. While calling Procedure "GenerateData". Hitting the below error.

Could not execute 'CALL "GenerateData"( IP_SCHEMA_NAME => 'DATA1', IP_TABLE_NAME => 'Test', IP_INSERT_NUM_RECORDS => 12, ...' in 149 ms 315 µs .

SAP DBTech JDBC: [257]: sql syntax error: "SYSTEM"."GenerateData": line 153 col 4 (at pos 7942): incorrect syntax near "FROM": line 4 col 6 (at pos 59)

Need help. New to sql, sap hana.

andruha · ‎08-25-2021

@poooja11091992 You probably forgot to create sequence.

former_member66033 · ‎07-29-2022

@Pooja Shetty I got exactly the same error. It's probably worth to know that i'm running HANA 2.0 SPS6 (2.00.060.00.1637829839).

sefan.linders2 Is it possible that the procedure uses a SQL statement here which are not longer valid? It's written using SPS4, while i'm trying to get it running on SPS6

cheers,

Thomas

Sefan_Linders · ‎08-01-2022

In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call.

Sefan_Linders · ‎08-01-2022

In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call. If the procedure can't find the table that you created, the metadata fetch fails and the resulting INSERT query will be incorrect.

former_member66033 · ‎08-02-2022

Thanks Sefan,

your answer was very helpful. The procedure itself did run using the correct schema - but the tables i've created beforehand didn't.

I fixed that in my database, now it's running perfectly!

former_member688075 · ‎12-23-2022

Hi Sefan,

I ran into the error below while inserting 10,000,000 rows with this procedure:

Could not execute 'CALL "GenerateData"( IP_SCHEMA_NAME => 'SAPABAP1', IP_TABLE_NAME => 'ACDOCA', IP_INSERT_NUM_RECORDS ...' in 6:53.173 minutes .
[314]: numeric overflow: "SYSTEM"."GenerateData": line 153 col 4 (at pos 7947): Failed in "PROZS_PN" column with the value 1000.00

Column PROZS_PN is defined as follows:

"PROZS_PN" DECIMAL(5,2) CS_FIXED DEFAULT 0 NOT NULL

Inserting 100,000 rows didn't raise this issue, but failed with 500,000 and 1,000,000 rows each. Any idea?

BR,
Jinsol

former_member688075 · ‎12-23-2022

Plus, I found an improvement point. If I execute this procedure with a certain schema set on client session, it fails with an error below:

SET SCHEMA SAPABAP1;

CALL "GenerateData" (
IP_SCHEMA_NAME => 'SAPABAP1',
IP_TABLE_NAME => 'ACDOCA',
IP_INSERT_NUM_RECORDS => 200000,
IP_MAX_STRING_SIZE => 2,
IP_DELETE_NUM_RECORDS => 0,
IP_UPDATE_NUM_RECORDS => 0
);

Could not execute 'CALL "GenerateData" ( IP_SCHEMA_NAME => 'SAPABAP1', IP_TABLE_NAME => 'ACDOCA', ...' in 8 ms 349 µs .
SAP DBTech JDBC: [288]: cannot use duplicate table name: "SYSTEM"."GenerateData": line 71 col 3 (at pos 3498): #SURROGATE: line 1 col 37 (at pos 36)

FYI, my testing was done on HANA 2 Rev 64.

BR,
Jinsol

zili_zhou · ‎12-07-2023

Thanks, Sefan! I tested this in HANA Cloud and it also works well. By the way, do we have a way now to control the distinct the records for each column so that I could better simulate the customer's business data compression rate in HANA? Like cost center, versions have normally very few distinct value even we have X millions to X00 millions of records in HANA.