Re: Data Load Integrity Control

XaviPolo · 3 weeks ago

Hello,

I have a question about what happens when an error occurs during a data load in SAP Datasphere.

If we use a data flow to perform a load to a DSP Local Table, do all the processed data operate within a database transaction, and if it fails, is there a rollback? Or is each batch treated as an individual transaction? I don’t see any option or flag to control this (as is the case in SAP Data Services), and I’m not sure about the default behavior.

Regards,

TuncayKaraca · 3 weeks ago

Hi @XaviPolo ,

I see there is no reply yet! It's a very good question! Maybe you could do an experiment and analyze the results. Intentionally somehow --give a bad data-- and fail some batch, etc. then see what happens. 🙂

Looking forward to seeing more replies to your question as we try to make Data and Analytics group more active! 😉

Regards,
Tuncay

AAshad · 3 weeks ago

Hi XaviPolo,

Excellent question.

I haven't been able to find documentation on the topic.

Since all activity involving loading data into database tables fall under 2 phase commit and are timestamped, I

would assume everything has to succeed or a full or point in time rollback occurs. (But you know what they say when you assume).

XaviPolo · 2 weeks ago

@TuncayKaraca ok, I'll do the tests 🙂

TLDR: Commits are done for each block defined in fetch size

Test definition

The test consists of loading a CSV with 30,000 rows by means of a Data Flow, and to provoke an error in the middle of the process, to see what we obtain in the destination table.

The CSV has 30000 rows, with two fields:

ID: consecutive numeric
VAL: string with the value '1' in all rows except 20101 which has a 'AAA'.

The source is SFTP, configuring the CSV as:

ID as Int32
VAL as String(1).
Control Fetch Size active to 1000 rows.
Fail Run on String Truncate = On. This will cause row 20101 to give an error for exceeding the defined size of string(1).

The destination is a DSP local table with

ID: int32
VAL: String (1)
Control Fetch Size active to 1000 rows.

Test Results

The execution of the data flow ends with error.

In the destination table, data has been saved, specifically the IDs ranging from 1 to 19000.
So I draw two conclusions:
- The most important is that the DF execution is not an “all or nothing” type, and commits have been made during the process.
- The second conclusion is that as we have two fetch controls, in source and in destination, it seems that this makes that the block of rows 19,001-20,000 has not been loaded even if it did not have any error, surely, when the block of 1,000 rows with error was read the exception was thrown, and the part that saves in the destination had not yet processed the block of 19,001-20,000, and that is why we do not have it.

I did a second test by increasing the rows to 110,000, leaving the wrong row at 105,001, and removing the fetch size control on both sides, to check if the theoretical maximum limit of the fetch size is 100,000.
But it saved from row 1 to 104,000 ... implying that the fetch size was not 100,000. I don't know if the value it takes depends on the source, the combination of them, etc.

Which implies

When doing ETLs with delta loads that are manually managed based on e.g. timestamps, there is a risk that when an error occurs, rows from a failed load are left in the target table.
This may cause the calculation of the next timestamp to be loaded to be erroneous, so these rows should be deleted before their calculation.

Regards,

TuncayKaraca · 2 weeks ago

@XaviPolo What a great experiment! 😉 and what a great scientific proof of how DFs work with Control Fetch Sizes in error situations. ✔️ Very valuable findings!

It's interesting to see that unexpectedly even the previous block --in your example 19,001-20,000-- was not loaded. This proves the value of the experiment! I think you've explained the reason very well.

There still seems to be a question about the theoretical maximum limit of the fetch size being 100,000. Where has it been mentioned that the maximum limit of fetch size is 100,000? Is it an official limit?

The risk associated with delta loads is so critical. That's why monitoring and managing ETLs is so important.

Thank you

XaviPolo · 2 weeks ago

The form has a validation for "Batch Size" that shows a message "Enter a number between 0 and 100000." if criteria is not meet.

AAshad · 2 weeks ago

Hello XaviPolo,

Just a follow up on data loads into DataSphere with data flows.

The ETL processes for loading data into DataSphere should be based on the SAP HANA data transfer model which includes the persistence layer which is responsible for the durability and atomicity of transactions and transaction manager which is responsible for coordinating all database transactions and keeping track of all running and closed transactions. When a transaction is executed or failed, Transaction manager notifies relevant data engine to take necessary actions.

You thoughts.

XaviPolo · 2 weeks ago

I think the question is what does the system consider a transaction? If it considers the whole process or each one of the batches.
From the tests, it is clear to me that the whole process is not managed as a single DB transaction.
Actually this is fine for some cases, since in loads with a lot of data keeping it in a single transaction consumes a lot of resources, but in other smaller ones it can be important to use a single transaction to facilitate the data integrity.
And the problem gets more complicated, when in DSP we don't have a path for failures, for example to do cleanup after a failure.

Maybe if we do a HANA to HANA data load it will be different, but at least I am more concerned about loading data from external systems, where I have more chance of failure.

Regards,

TuncayKaraca · 2 weeks ago

I got you. Thanks! So removing the fetch size control on both sides and it saved from row 1 to 104,000 even though error was on 105,001 can we say there maybe a default fetch size is 1,000?

XaviPolo · 2 weeks ago

I would say yes