Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
christian_willi
Explorer

Introduction

When using SAP Datasphere to transform data for persistence, the Data Flow provides the necessary functionality. We recently compared various basic transformation tasks using different modeling approaches. Therefore, we tried four different approaches to implement a certain logic:

  1. Modelling with the Standard Operators in the Data Flow
  2. Modelling with a Graphical View as a source to be consumed in the Data Flow
  3. Modelling with a SQL View as a source to be consumed in the Data Flow
  4. Modelling with the Script Operator in the Data Flow.

The goal was to give a recommendation about what approach might be the best for various scenarios in case of runtime, maintenance, other categories and if every scenario can even be modelled with every approach. We implemented the following scenarios:

  • String to Date Conversion
  • Join Data
  • Concatenate Columns
  • Aggregate Data
  • Transpose Data and Aggregate
  • Regex
  • Unnesting Data
  • Generate a Hash
  • Generate a Rank Column
  • Calculate a moving Average

Setup

To have a comparable setup, we performed this action with an identical dataset, which contains the following columns:

  • Region
  • Country
  • Item Type
  • Sales Channel
  • Order Priority
  • Order Date
  • Order ID
  • Ship Date
  • Unit Sold
  • Unit Price

We uploaded this dataset (a CSV file) into a table. The table then contained 10 million records. The reason for that is that we wanted to get a feeling how Data Flows and Datasphere handles big amounts of data.

Results and Interpretation

The outcome of our tests is now displayed in the table below. Note that the runtimes are displayed in MM:SS format, with seconds rounded to minutes if the runtime exceeds a few minutes.

Scenario

Python (Script Operator)

Standard Operator

Graphical View

SQL View

String to Date

45:00

00:45

00:58

00:49

Join

 NA

01:00

00:53

00:50

Concatenate

36:00

00:52

00:51

00:36

Aggregation

23:00

00:39

00:25

00:37

Transpose and Aggregation

24:00

00:50

00:28

00:24

Regex

36:00

00:59

01:00

00:50

Unnesting Data

14:00

NA

NA

00:38

Hash

234:00

NA

01:00

01:00

Rank

40:00

NA

00:58

01:00

Moving Averages

23:00

NA

NA

00:21

For better comparison, the chart below provides an overview in logarithmic scale.

1_execution_times_plot_log.png

One of the first findings is that between the Standard Operator, the Graphical View and the SQL View there is not a huge difference. Given the amount of data, the performance is overall quite pleasant.

Additionally, some requirements or tasks are not feasible with the Standard Operator or the Graphical View, but an SQL View supports a wide range of possibilities.

The elephant in the room is obviously the performance of the Script Operator. The one thing which should enhance your possibilities as a developer with a currently very popular programming language does not perform in any acceptable way compared to the other options. After we did our tests, we contacted SAP support to verify one of our scenarios. We thought we missed something in our modelling approach or probably this is even a bug. Maybe we missed the “Make it fast” setting. But after we posted our incident, we got some insight from SAP Support why this is slow. Spoiler alert: We did not miss the “Make it fast” setting. The explanation for this is quite simple. When you use the Standard Operators (without the Script Operator), the Graphical View or the SQL View everything can be performed directly on the database. However, when you use the Script Operator all the data which is processed in the Script Operator needs to be transferred to a separate SAP DI cluster which will perform the Python operation and afterwards the result needs to be transferred back. In our case that is 10 million records which is almost about 1GB of data. We tried to illustrate the process based on the feedback from SAP in the picture below on a high level.

2_data_flow_matrix.png

Also, the recommendation by the support was that the Script Operator should only be used if the requirement cannot be implemented with one of the other options. However, we think that how the Script Operator is advertised by SAP this can be an unpleasant surprise. Currently we see the Script Operator to be used very carefully, because in the end it might be a bottle neck in processing data during a transformation. Now one could argue that 10 million records is not something which is transferred on a regular basis in data warehouses, but we think this statement is not correct. In current SAP BW Warehouses, we regularly see the amount of data which is growing. Transferring at least 1 million records daily is not uncommon. Initially we were very excited to used Python, but currently we would generally advise against its use unless absolutely necessary. Even then, be prepared for potential performance issues during the runtime of your Data Flows.

Conclusion

To reiterate, the primary takeaway is the recommendation to avoid using the Script Operator in a Dataflow. Due to our test and the incident we submitted to SAP, we gained insights into how the data is processed in the background. We also searched to find if SAP provides this information already somewhere within the Datasphere documentation but could not find it. This might be helpful to gain a better understanding. It might be slightly misleading how the Script Operator is advertised. It's important to be aware of its limitations, making SQL the preferred option for now.

11 Comments
TuncayKaraca
Active Contributor

Hi @christian_willi,

I've happily given kudos 👍 immediately for such a great experiment and blog post. Well done! ✔️ Thanks.

With your analysis and findings Python Script Operator in Datasphere reminds me that like ABAP Engine vs HANA database (Code Pushdown). Yes, indeed Python Script Operator frankly does not use HANA database and that's as you mentioned a bottleneck and we should avoid using it.

Also, it's interesting to find out there is a separate SAP DI cluster which performs the Python operations. 😯

On the other hand Graphical Views are handy as long as their capabilities are satisfactory but also we should note that SQL Views are very powerful with support a subset of the SQL syntax supported by SAP HANA Cloud.

Regards,
Tuncay

tmeyer
Explorer

Thanks for the insights.

Venky999
Explorer

Thank you Chrisitian for sharing this info. 

Marc-An_BERTHET
Explorer

Wonderful blog thank you !

GerritPosthumus
Participant

The DI cluster are kubernetes based and it need to spin-up a docker container to run the code so this also add around 2 minutes of execution time.

XaviPolo
Active Contributor

Great post.

One more thing about Python Operator, that must be considered in some scenarios: the python program is executed per each block of rows (block size is specified in Batch Size, but max is 100.000 rows per block) , so it is not possible to perform calculations that need to process all data together.

TuncayKaraca
Active Contributor
0 Kudos

@GerritPosthumus Is there a reference documentation for the information you gave? 

TuncayKaraca
Active Contributor
0 Kudos

@XaviPolo Could you share reference information for "Python Operator -python program is executed per each block of rows (block size is specified in Batch Size, but max is 100.000 rows per block)"? Is it a  Datasphere or Python thing?

XaviPolo
Active Contributor

@TuncayKaraca it's a Data Flow thing. Python has no limit (more than addressable memory).

This is something I was able to verify myself since the first version of the DFs.

You can test it using the following code: 

def transform(data):
    data['row_num'] = np.arange(data.shape[0])
    return data

This is to add a "row number" column (starts with 0), if you have a table with 1000 rows and you set 1000 rows as Batch Size for DF you will get 0-999 in "row_num" column. If you have a table with 2000 rows (and use 1000 as batch size) you will see that each value of num_row has 2 rows, because data was splitted into 2 blocks of 1000 rows and executed independently in python.

P.S.: about what @GerritPosthumus  comments, it is because the Dataflows part is an integration of SAP Data Intelligence processes in DSP. That is why they have a high overhead, they have to start and move data between systems. DI is kubernetes based.

Regards,

christian_willi
Explorer

@TuncayKaraca the behaviour, that the data is split into several batches is also described in this blog https://community.sap.com/t5/technology-blogs-by-sap/how-to-use-the-script-operator-in-sap-data-ware...

TuncayKaraca
Active Contributor
0 Kudos

@XaviPolo Thanks for the information. It's good to see a proof directly in Datasphere. 

@christian_willi Thanks for reference blog post for the batches scenario. I've read it all.

Question to @christian_willi@XaviPolo and everybody: So could we say that we should avoid using Python Script Operator for like transformation scenarios (conversion from Datetime to String, Extraction of Substrings) because of performance reasons but there is a use case for Python Script Operator like Task 1 - Classify numerical data as bins in @P_Plazi blog post How to use the Script Operator in SAP Data Warehouse Cloud for Data Manipulations 

Labels in this area