Sap Datasphere Data Flow Series – Script Operator...

Kunal_Mohanty · ‎11-25-2023

Hey Champs,

Hope you enjoyed the last blog on Sap Datasphere Data Flow Series – Operators (Joins, Projection, Aggregation, Union), well its just the trailer. Today lets jump into script operator of Datasphere in details.

The Script Operator seamlessly integrates the functionalities of popular Python libraries Pandas and NumPy with SAP Data Warehouse Cloud, enabling the creation of Data Flows and tailored information views. This versatile operator caters to a diverse range of tasks, including data cleansing, data transformation, and more.

Syntax for script operator:

-----------------------------------------------------------------------

def transformation(data):

# Fill in your scripts here with data as a source

# and save the values into data as output

return data

------------------------------------------------------------------------

Currently included libraries are restricted to Pandas, NumPy and several built-in module operations. But still not all functions are supported we can check the official documentation for the supported things.

Sap Official Documentation

Limitations of Sap Datasphere Script Operator:

We have a dedicated python editor where we can write our python code. But the only limitation is that we don’t have a syntax validator. And that’s where it hurts a lot, hope will get in future soon.

Due to the maximum batch size limit of 100,000 entries, the given Pandas function cannot be used to effectively remove duplicates. Using SQL-Script is the recommended approach for this task to ensure accurate duplicate detection, even across batches.

Limitation due to batch size

Importance of DataSphere script operator in real world:

The Script Operator in SAP DataSphere offers several valuable use cases and is crucial for real-world data manipulation and analysis tasks. Here are some key usage and advantage of the Script Operator:

Data Cleaning and Preprocessing: The Script Operator works very well in cleaning and preparing raw data for analysis. It helps the users to handle tasks like removing invalid characters, imputing missing values, standardizing data formats, and detecting outliers. This ensures data quality and reliability for analytical processes.

Complex Data Transformations: The Script Operator extends the capabilities of SAP DataSphere by enabling complex data transformations which are not feasible with built-in operators. It provides user the flexibility to perform advanced data manipulation techniques like feature extraction from text data, data enrichment through external sources, and custom aggregation functions.

Integration with Python Libraries: The Script Operator seamlessly integrates with popular Python libraries like Pandas, NumPy, and built-in modules. This allows users to leverage the extensive functionalities of these libraries for data manipulation, analysis, and exploration.

Handling Unstructured Data: The Script Operator effectively handles unstructured data formats like text, JSON, and XML. Users can write Python code to parse, extract, and transform unstructured data, making it usable for downstream analysis.

Custom Data Validation and Checks: The Script Operator facilitates custom data validation and checks. Users can write Python code to implement custom business rules, data quality checks, and anomaly detection mechanisms, ensuring data integrity and consistency.

Note : We will try to learn from use case about how we can use the python operators and get familiar with the syntax. Not all use cases will be useful. Try to get the use of the operators and how to write the code

Use Case 1:

We will always get a scenario where we want to count the number of employee for each department or number of items in a order. So in this case we will try to take a similar example where I will use script operator to find the count of different payment mode. Using this we will see how efficient is script operator and what flexibility it gives us while writing python code.

Let’s have a look at our data:

This is our order table data and using this I want to count the payment modes.

Order Table Sample Data

So we will expect some output like this.

Expected Output Script Operator

Let’s have a look at the code and understand it step by step:

Python Script to group Data

This code defines a function called transform that takes a Pandas DataFrame as input and returns a new DataFrame with two columns: 'Payment Mode' and 'Count of Payment mode'. The function performs the following steps:

Convert 'Payment Mode' column to categorical data type: The Payment Mode column is converted to a categorical data type using the pd.Categorical() function. This is useful for tasks like data analysis and machine learning, as it allows for more efficient data manipulation and representation.

Group the data by the 'Payment Mode' column: The data is grouped by the 'Payment Mode' column using the DataFrame.groupby() method. This creates a grouping object that allows us to iterate over the data grouped by the payment mode.

Create a new DataFrame to store the desired output: An empty DataFrame is created with two columns: 'Payment Mode' and 'Count of Payment mode'. This DataFrame will store the final output of the function.

Now lets get our hand dirty by entering to datasphere and using the dataflow.

Add a table as source into the play area.

Add a projection and keep only payment mode and remove other columns.

Add a script operator and write the above script and connect everything.

Add a target table and connect the script operator to the target table. Now save and deploy it

Dataflow Script Operator

Conclusion :

This blog introduced the script operator available in datasphere data flow that we’ll use along the entire blog series, I have explained each and every part of the script operator in detail. Script allow is a very handy thing to do a lot of manipulation and analysis of data. The things which we can do using sql script, the same thing also we can do using script operator but since pyhon is used as a language so its gives us the flexibility to use lots of inbuilt function to do a more operation on data. I will continue the second part of this script operator to show what things are possible easily using script operator which we can't do using sql script.

Thanks for reading! I hope you find this post helpful. For any questions or feedback just leave a comment below this post. Feel free to also check out the other blog posts in the series and follow me to learn as well as master sap analytics. Let me know if you find something can be improved or added.

Best wishes,

Kunal Mohanty

MKreitlein · ‎11-27-2023

Hello Kunal,

thanks a lot for that interesting blog. I already started reading the linked SAP help page, but already on the first example, I have some doubts:

FYI I'm coming with BW background, where I only use start- or endroutines in Transformations - no expert routines.

In a Startroutine you always have only the fields of the source available and in the Endroutine only the fields of the target.

In example one it describes (at least in German) that the script expects exactly 5 fields in the source and exactly 3 fields in the target.

--> But when and how do I know if I work with my source structure or my target structure?

I mean, if the names would differ completely... where and when takes the "switch" place?

Maybe I'm thinking too complicated... but the fact that you don't have a syntax check nor a real explanation it is quite difficult for newbies (e.g. you start with pd.Categorical.... what is "pd."?)

Thanks for any feedback,

Martin

Jannis94 · ‎11-27-2023

Hey Martin,

it is not as complicated as it looks:

the object "data" always contains the source structure

the target structure has to be built in your coding and whatever you return in your return-statement has to match the target structure (i would assume names & order of columns)

pd is referring to the pandas library, so pd.XYZ is calling the function XYZ from that library.

BR

Kunal_Mohanty · ‎11-27-2023

def transform(data):

df = data[['first','last']]

df[full] =  df['first'] + '.' + df['last']

return df

--> But when and how do I know if I work with my source structure or my target structure?
Completely depends upon the requirement. "Data" contain 5 fields and "DF" contain 3 fields
we created a data frame "df" and kept 3 fields , two are existing fields and one is a new fields we added. Everything can be managed through the code, but name is source and target will be same if we are not changing it explicitly in the code.
Target structure, once u write the the script then automatically after that the script is ready based on that you have to create the target table like that will be automatically created.

"PD" refers to pandas library in python.
I can explain about script operator more in detail but certain things are not possible to write.
Best way to answer your query is.,you can give some of use case where you are facing issue then i can help with that.

Kunal_Mohanty · ‎11-27-2023

Thanks for replying Jannis

sidraz99 · ‎12-15-2023

Hello Kunal, Amazing blog. However as we know this can be achieved by SQL script as well. Just wanted to understand do we have any specific use cases or functions which cannot be achieved using SQL or can be more effective if we write python script for them.

GayathriR · ‎02-12-2024

Hello Kunal,

I need to create new Records and appended to my Target Table (Old + New).

My Record has Account ( Revenue, Discount ) & Amount has values for it.

The Newly Created Record has to be

Account = 'Net Revenue' with Amount = Amount. Revenue - Amount. Discount

Can this be done in a DF script.

Sap Datasphere Data Flow Series – Script Operator Part 1

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z