Introduction

former_member10846 · ‎12-06-2021

Introduction

This is the 4th of 5 blog posts which are part of the “SAP BTP Data & Analytics Showcase” series of blog posts. I recommend you to look into our overall blog post to gain a better understanding of the end-to-end scenario, which involves multiple SAP HANA Database & Analytics solutions in the cloud.

Nowadays, we experience and talk a lot about artificial intelligence, machine learning, predictive analytics, and data science in both our daily life and business. Experts like data scientists play an essential role in implement value-added data and analytics use cases for firms. However, data scientists are still facing some pain points, which hinder their ability to work effectively and derive business value from data in real-world environments. You can look into the blog post from one top data scientist expert by SAP and gain more insights from her perspective.

To demonstrate how our unified data and analytics solutions from SAP HANA Database & Analytics address a few pain points of data scientists and assist their daily work in an end-to-end way, with flexibility, motivates me to create this blog post.

In the upcoming sections of this blog post, I'd like to highlight how data scientists can choose solutions they want, combine capabilities from our portfolio and solve a specific problem end-to-end:

Data Preparation - To close the gab between Data Scientists and IT, we recommend that data scientists could be involved early in the phase of data integration & preparation. One single place (SAP Data Warehouse Cloud) is offered to combine all necessary data from different sources, across the company's real IT landscape.

Model Creation - Open source (e.g., Python or R) is an important pillar of data science projects. Data scientists can continue using Python script in Jupyter Notebook, where the connection to artefacts of SAP Data Warehouse Cloud has been established, and train ML models.

Model Inference - The trained ML models are reused to make a prediction. The prediction results are easily written back to SAP Data Warehouse Cloud (via so-called Open SQL Schema), which can be consumed later to enhance your data models.

Result Visualization - As a last step - to show findings/patterns behind data in a consumable way, e.g., to business users, a nice dashboard combining predictive results is created or enhanced via self-service tools in SAP Analytics Cloud.

Figure 1: End-to-End Data Science Scenario

Most of you have heard of the 80/20 dilemma: It roughly says that data scientists spend about 80 % of their time for generating and preparing data and only 20 % of their time for building and training models! We understand what matters most is NOT about algorithms. Therefore, we're intending to demonstrate how our unified solutions can assist data scientists in data preparation phase, with real-world data, and develop a data science project in an end-to-end way.

Figure 2: Data Preparation in SAP Data Warehouse Cloud

Figure 3: Model Creation & Inference via Python in Jupyter Notebook

Figure 4: Data & Result Visualization in SAP Analytics Cloud

Key Scenario

This data science scenario is implemented base on our famous "Tankerkönig" data model. We utilise the historical gasoline prices data (2017-2021) in Germany from this public website, where prices data is stored in CSV files. We use the data in this website only for blog posting and demonstration purpose.

As you may be aware, the gasoline prices in Germany have been increased dramatically this year, compared to the same time periods of previous years. The pandemic of Covid-19 has changed the way business is operating today. In our scenario, we are intending to conduct the time series forecasting for the gasoline prices in Germany, by taking the factor - "Covid-19 Number of cases" - into consideration. We think this could help firms or individuals better plan to fuel vehicles, depending on predicted gasoline price trends. We get the information about Covid-19 cases from this public repository under RKI institute.

To solve the above-mentioned problem, the following three sub-scenarios are defined and implemented.

Figure 5: High-level solution map and implementation steps

Part 1: Data Preparation

Utilise SQL-on-Files capability of SAP HANA Cloud to query files of the recent 5 years historical prices

Integrate data from different sources in SAP Data Warehouse Cloud, using Federation or Replication capability

Build harmonised views combining all necessary data via Self-Service Modelling features in SAP Data Warehouse Cloud

Part 2: Model Creation & Inference

Enable HANA Cloud Script Server for Machine Learning and read-access all views of SAP Data Warehouse Cloud via Python Script

Consume algorithms from HANA Python Client API - APL Library and create ML models for Multiple Time Series Forecasting

Run the trained ML model to predict gasoline prices for the next 7 days

Write predictive result directly back to SAP Data Warehouse Cloud via the so-called Open SQL Schema and enhance data models

Part 3: Result Visualization

Create a Dashboard to show historical gasoline price changes (2017-2021) in SAP Analytics Cloud

Analyse the correlation between gasoline prices and Covid-19 case numbers by leveraging Business Intelligence capability of SAP Analytics Cloud

Enhance the dashboard with Prediction Results from ML models

Meanwhile, I share the following use cases of SAP Data Warehouse Cloud with you and hope this can give some insights and ideas how to leverage such capabilities in your individual data science scenario.

Figure 6: Use Cases of SAP Data Warehouse Cloud

Implementation

We have prepared the following sub-blog posts, which can explain implementation in more details.

Blog 1 - Data Preparation for a Data Science Scenario in SAP Data Warehouse Cloud: How to prepare data in a typical data science scenario, taking advantages of different capabilities in SAP Data Warehouse Cloud, e.g., federation & replication, self-service modelling and seamless integration with SAP Analytics Cloud.

Blog 2 - Machine Learning via Python in SAP Data Warehouse Cloud: how to use Python Script to access training data set from SAP Data Warehouse Cloud, create machine learning models, run prediction through ML models and write results back to SAP Data Warehouse Cloud (for modelling and visualisation purpose).

In the blog 1, we skipped the SQL-on-Files part and started directly with data preparation in SAP Data Warehouse Cloud. If you are interest how to query 5 years historical data, I recommend you to look into this blog post from my colleague Jacek, who describes the data management concept and SQL-on-Files very well in his blog post.

At the end of both blog posts, the visualisation part is shown via stories in SAP Analytics Cloud, as we have established the data models (views) by consuming prepared data and prediction in SAP Data Warehouse Cloud. However, the step-by-step guidelines for implementation in SAP Analytics Cloud are skipped, as we are aiming to share the end-to-end approach with you.

Conclusion

We hope this blog post could give you a comprehensive overview how unified data and analytics solutions from SAP HANA Database & Analytics area can facilitate data scientists with their daily work in an end-to-end way, with flexibility. Thank you for your time, and please stay tuned and curious about our upcoming blog posts!

At the very end, I would like to say thank you to my colleagues Yann Le Biannic, Antoine Chabert, Andreas Forster, Axel Meier and Jacket Klatt, who share their expertise and help make this end-to-end demo story happen!

We highly appreciate for all your feedbacks and comments! In case you have any questions, please do not hesitate to ask in the Q&A area as well.

pbaumann · ‎12-09-2021

Hi Wei Han!

Thank you for your wonderfull blogs showing how we can work today bringing components like DWC, HANA Cloud and Python together!

One point is not clear to me, maybe you can help. For your point:

Part 1: Data Preparation

Utilise SQL-on-Files capability of SAP HANA Cloud to query files of the recent 5 years historical prices

you reference to the SAP HAN Cloud, Data Lake feature to put files like csv, parquete or ORC in the data lake. In the part of your next blog https://blogs.sap.com/2021/12/06/sap-btp-data-analytics-showcase-data-preparation-for-a-data-science... you catch data two times in Part 1 Step 2 from HANA Cloud. For my understanding this is just a "normal" HANA Cloud Table. In Step 3 you catch a csv file but this is directly imported (from your desktop or somehow). So I havn't really seen the SQL-on-Files capability. Am I'm wrong?

Thank you for further explainations!

Best regards,

Peter

former_member10846 · ‎12-10-2021

Hi Peter,

Thank you for reaching out.

We applied the SQL-on-Files capability only to 5 years historical prices data, as it contains a huge amount of data. The CSV file mentioned in the Step 3 of the Data Preparation blog is the COVID-19 data, which was downloaded as a CSV file and uploaded directly into SAP Data Warehouse Cloud.

Yes, you're right. I skipped the implementation of SQL-on-Files part in this blog series. As mentioned in the implementation section, you can read this blog post "Data management concepts and techniques in SAP HANA Cloud" from my colleague Jacek for more details, as he helped me here.

Hope this answers your questions. 🙂

Best regards,

Wei

pbaumann · ‎12-10-2021

Hi Wei!

Thank you for giving me a better understanding. The blog you mentioned from jacek.klatt2 is also really great and detailed!

Best regards,

Peter

SAP BTP Data & Analytics Showcase - Empower Data Scientist with Flexibility in an End-to-End Way

Introduction

Key Scenario

Part 1: Data Preparation

Part 2: Model Creation & Inference

Part 3: Result Visualization

Implementation

Conclusion

Part 1: Data Preparation

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win