Technology Blogs by SAP
Learn how to extend and personalize SAP applications. Follow the SAP technology blog for insights into SAP BTP, ABAP, SAP Analytics Cloud, SAP HANA, and more.
cancel
Showing results for 
Search instead for 
Did you mean: 
Sefan_Linders
Product and Topic Expert
Product and Topic Expert

Introduction


Recently I ran a scenario on the SAP HANA database for which I needed a large initial data set for a specific table, and another large delta data set consisting of deletions, updates and inserts on the initial data set. I googled something like “data generator for HANA” but didn’t find what I was looking for. Therefore, I wrote a database procedure to fill my table with the data that I needed. Because others might find the script useful too, I extended the script for easy re-use. The result you can find in this blog.

The procedure inserts, updates, or deletes data of a single table, using mostly set-based operations for high-performance. Inserts and updates are based on the field definition of the provided table and contain randomized data. For example, for an integer field, a random integer value is inserted, and for a character-based field, a random string of characters is inserted. The procedure uses the HANA metadata tables to retrieve the table definition. Record updates or deletions are randomized based on the first column of the primary key.

On my little test system, I achieved an insert rate of 9 million records per minute for a table with just a few fields. That was a lot faster than the first version of the script, which was not set-based.

How to run the data generator procedure


Follow below steps to generate data for your table

  1. Dowload the code in this Git repository and copy it into the SQL console

  2. Create a sequence with following statement: "CREATE SEQUENCE GENERATEDATASEQ1". A sequence with that name is used in the procedure.

  3. Call the procedure using the following statement:


CALL "GenerateData" (
IP_SCHEMA_NAME => '<SCHEMA_NAME>',
IP_TABLE_NAME => '<TABLE_NAME>',
IP_INSERT_NUM_RECORDS => <Number of records to be inserted>,
IP_MAX_STRING_SIZE => <Maximum inserted string size. Strings are automatically limited on field length>,
IP_DELETE_NUM_RECORDS => <Number of records to delete>,
IP_UPDATE_NUM_RECORDS => <Number of records to update>
);

Limitations


Keep in mind the following limitations, or adjust the code to work around them:

  • Only supports tables with a primary key. The primary key should contain at least one column that can hold integer values. Binary data types are not supported at all.

  • Created with HANA 2, SPS04, not guaranteed to work with older service packs.

  • The random generated values make compression difficult. Therefore, you will likely see minimal compression that is not to be compared with real-life use cases.


Example


As an example, a table is created with the following statement:
CREATE SCHEMA GENERATEDATA;
SET SCHEMA GENERATEDATA;
CREATE TABLE T1S (
A INTEGER,
B NVARCHAR(5),
C DATE,
D TIMESTAMP,
PRIMARY KEY (A)
);

To insert two records into that table, the procedure is called as follows:
CALL "GenerateData"( 
IP_SCHEMA_NAME => 'GENERATEDATA',
IP_TABLE_NAME => 'T1S',
IP_INSERT_NUM_RECORDS => 2,
IP_MAX_STRING_SIZE => 3,
IP_DELETE_NUM_RECORDS => 0,
IP_UPDATE_NUM_RECORDS => 0
);

A select * on that table yields the following result:


Explanation of the code


In case you want to adjust the code or just understand what happens, an example table definition and procedure call is provided in the header section of the procedure. Throughout the code, output of (parts of) the dynamic SQL statements are shown, so it is easier to understand the, sometimes, long statements.

Troubleshooting


In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call.

Conclusion


This blog provided you with a procedure to generate initial and delta data sets. Since the inserts are set-based, it should insert, update and delete at reasonable speeds. There are some limitations, and the code is not perfect, so if you got any improvements, feel free to share them in the comments.
14 Comments
zachi85
Employee
Employee

Hi Stefan,

Thanks for this really helpful post!

I built a simple shell script which executes your procedure 20 times to create 20 tables with 50 million records each with a user I called GENDATA. With the below configuration I got 4.7GB per table in memory and on disk and each run took about 15 minutes (so it took roughly 3 minutes to generate 1GB).

Maybe this is useful for someone else, so I leave it here:

wrapper.sh

n=1
while [ $n -le 20 ]
do
echo “=====Run: $n=====”
hdbsql -U GENDATA “DROP SEQUENCE GenerateDataSeq1;”
hdbsql -U GENDATA “CREATE SEQUENCE GenerateDataSeq1;”
hdbsql -U GENDATA “CREATE TABLE TABLE$n (A INTEGER, B NVARCHAR(20), C DATE, D TIMESTAMP, E NVARCHAR(20), F NVARCHAR(20), PRIMARY KEY (A) );”
hdbsql -U GENDATA “CALL \”GenerateData\” (IP_SCHEMA_NAME => ‘GENDATA’, IP_TABLE_NAME => ‘TABLE$n’, IP_INSERT_NUM_RECORDS => 50000000, IP_MAX_STRING_SIZE => 20, IP_DELETE_NUM_RECORDS => 0, IP_UPDATE_NUM_RECORDS => 0);”
n=$(( n+1 ))
done

 

 

staerk
Participant
0 Kudos
Hi,

nice work, I can see that this all will work, but the first sentence struck me: Paste it into the SQL Console. I do not have HANA Studio nor HANA Cockpit, where is the SQL Console?

thanks

Thorsten
staerk
Participant
found the answer, if you do not have an SQL Console, you can download the code from github as text file myfile.txt and run it in a sidadm shell like this:

hdbsql -u SYSTEM -p mypassword -I myfile.txt

 

Great, I created my first stored procedure 🙂
staerk
Participant
Works for me, thanks a lot!

Users who do not have graphical tools can use hdbsql. To query the table, use

hdbsql -u SYSTEM -p mypassword "select * from GENERATEDATA.T1S"|cat
staerk
Participant
0 Kudos
works, thanks a ton! To create a user GENDATA you must

create user GENDATA password passW0rd

Then hdbsql -U GENDATA will not allow for queries, but you do not need to use GENDATA to call the script, just its schema must exist. So after creating the user GENDATA, you can use hdbsql -u SYSTEM -p mypassword instead of hdbsql -U GENDATA.

 
0 Kudos
Downloaded the code from git hub. While calling Procedure "GenerateData". Hitting the below error.

Could not execute 'CALL "GenerateData"( IP_SCHEMA_NAME => 'DATA1', IP_TABLE_NAME => 'Test', IP_INSERT_NUM_RECORDS => 12, ...' in 149 ms 315 µs . 


SAP DBTech JDBC: [257]: sql syntax error: "SYSTEM"."GenerateData": line 153 col 4 (at pos 7942): incorrect syntax near "FROM": line 4 col 6 (at pos 59)


Need help. New to sql, sap hana.
andruha
Discoverer
0 Kudos
@poooja11091992 You probably forgot to create sequence.
former_member66033
Discoverer
0 Kudos
@Pooja Shetty I got exactly the same error. It's probably worth to know that i'm running HANA 2.0 SPS6 (2.00.060.00.1637829839).

sefan.linders2 Is it possible that the procedure uses a SQL statement here which are not longer valid? It's written using SPS4, while i'm trying to get it running on SPS6

 

cheers,

Thomas

 
Sefan_Linders
Product and Topic Expert
Product and Topic Expert
0 Kudos

In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call.

Sefan_Linders
Product and Topic Expert
Product and Topic Expert
0 Kudos

In case you run into an error that contains "incorrect syntax near "FROM"", check if you entered the right schema and table name as parameter for the procedure call. If the procedure can't find the table that you created, the metadata fetch fails and the resulting INSERT query will be incorrect.

former_member66033
Discoverer
Thanks Sefan,

 

your answer was very helpful. The procedure itself did run using the correct schema - but the tables i've created beforehand didn't.

I fixed that in my database, now it's running perfectly!
0 Kudos
Hi Sefan,

I ran into the error below while inserting 10,000,000 rows with this procedure:

Could not execute 'CALL "GenerateData"( IP_SCHEMA_NAME => 'SAPABAP1', IP_TABLE_NAME => 'ACDOCA', IP_INSERT_NUM_RECORDS ...' in 6:53.173 minutes .
[314]: numeric overflow: "SYSTEM"."GenerateData": line 153 col 4 (at pos 7947): Failed in "PROZS_PN" column with the value 1000.00

Column PROZS_PN is defined as follows:

"PROZS_PN" DECIMAL(5,2) CS_FIXED DEFAULT 0 NOT NULL


Inserting 100,000 rows didn't raise this issue, but failed with 500,000 and 1,000,000 rows each. Any idea?

BR,
Jinsol
0 Kudos
Plus, I found an improvement point. If I execute this procedure with a certain schema set on client session, it fails with an error below:

SET SCHEMA SAPABAP1;

CALL "GenerateData" (
IP_SCHEMA_NAME => 'SAPABAP1',
IP_TABLE_NAME => 'ACDOCA',
IP_INSERT_NUM_RECORDS => 200000,
IP_MAX_STRING_SIZE => 2,
IP_DELETE_NUM_RECORDS => 0,
IP_UPDATE_NUM_RECORDS => 0
);

Could not execute 'CALL "GenerateData" ( IP_SCHEMA_NAME => 'SAPABAP1', IP_TABLE_NAME => 'ACDOCA', ...' in 8 ms 349 µs .
SAP DBTech JDBC: [288]: cannot use duplicate table name: "SYSTEM"."GenerateData": line 71 col 3 (at pos 3498): #SURROGATE: line 1 col 37 (at pos 36)

FYI, my testing was done on HANA 2 Rev 64.

BR,
Jinsol
zili_zhou
Advisor
Advisor
0 Kudos
Thanks, Sefan! I tested this in HANA Cloud and it also works well.  By the way, do we have a way now to control the distinct the records for each column so that I could better simulate the customer's business data compression rate in HANA?  Like cost center, versions have normally very few distinct value even we have X millions to X00 millions of records in HANA.