What's New with HANA ML in SAP HANA Cloud 2024 Q1

ChristophMorgen

With the 2024 Q1 database release, several new features have been released the SAP HANA Cloud Predictive Analysis Library (PAL), an enhancement summary is available in the What’s new document for SAP HANA Cloud database 2024.02 (QRC 1/2024).

The feature highlights for the current release are described in more detail below

Classification and Regression enhancements

Unified Regression along with Unified Classification and Time Series now supports permutation feature importance, a new and trending method in global explain-ability to evaluate the contribution of individual features to the overall predictive power of a model. This is achieved by measuring the decrease of a model’s performance when a feature‘s values are being shuffled around. A detailed explanation and examples are also given in this blog Global Explanation Capabilities in SAP HANA Machine Learning.

Classic feature importance vs permutation feature importance reports (see blog for details)

The Hybrid Gradient Boosting Tree (HGBT) now supports F1-scores, recall and precision as cross validation metric for improved, more targeted classification models. Furthermore, weight scaling of target values in classification is now supported to address imbalanced classes or weight scale target values in relation for example to different costs associated to the different class values.
A new and trending regression model objective function “reweighted square” has been introduced, aiding to achieve more robust and regularized regression models.
For improved early stopping during model optimization, the validation metric for early stopping can now be explicitly set.

The recently introduced multi-layer perceptron MLP recommender function, now supports multiclass classification and regression recommender scenarios. This allows to reformulate the recommendation task as a classification or regression problem. The implementation employs a dual-stream framework where two sets of features representing for example user – and items features, respectively, are fed into a feature selection module. The outputs are streamed into MLP-neural networks and combined in a bilinear aggregation layer. This new and trending neural network framework can handle large-scale data volumes in recommendation scenarios very effectively.

The K-Nearest Neighbor (KNN) classification and regression functions has been enhanced with a new similarity search method, in addition to brute force and KD-tree searching a matrix enabled search-method has been introduced, allowing for much faster similarity search results especially with high-dimensional numeric feature data.

Auto-ML and ML pipeline function improvements

The Auto-ML functions for the Predictive Analysis Library (PAL) have been enhanced with

a new option to trigger deeper finetuning of the best pipeline found
the genetic algorithm-based Auto-ML optimization has been enriched with a RANDOM SEARCH-based optimization, suited especially for smaller configurations (e.g. simple time series) and yielding with faster results
new method to clear and initialize the Auto-ML log
Auto-ML and pipeline model explain-ability enhancement with a SHAP Global surrogate light-weight model for faster global explanation model calculation and faster local prediction interpretability results

Text Processing

The Text Mining related document and term analysis function do now support massive parallel invocation, allowing for multiple input text to be analyzed in parallel.

Multiple documents (here IDs 0 and 5) are searched in parallel for related documents

New financial data analysis functions

The newly implemented single-factor Hull-White procedure , can be used to model the time evolution of interest rates, which are required for price estimation of financial instruments based on interest rate derivatives.

To apply the Hull-White model it first needs to be adopted to match existing market conditions (interest rates). This is achieved by providing the values of the drift term of the Hull-White model as a time series as input table. The simulation will then provide the mean value for a given number of simulation paths (also specified as an input parameter), their variance, as well as the upper and lower bounds.

The chart above depicts the initial dataset used to calibrate the mode, mean and confidence interval of the Hull-White simulation.

New Benford’s Law function in PAL, a trending algorithm used to detect anomalies in numerical datasets like e.g. financial transactions.

One of the (not so) well-known statistical observations is the fact that in many datasets the leading significant digits are not equally distributed. If all digits were represented equally, then they would appear 11.1 percent (1/9TH) of the time. However, when analyzing real-world datasets, e.g. the population totals of the US census data, it is revealed that the distribution of the leading digits follows the Bedford’s law, also known as the first-digit law.

P(d) = log10 (1+ 1/d), where P(d) is the probability of the leading digit {1,2,....9} to occur.

With the help of PAL’s new BENFORD analysis function it is now very easy to validate if a dataset obeys Bedford’s law or not. A first step means very commonly used in financial applications to detect unexpected value distribution and e.g. potential fraudulent transaction data.

Python ML client (Hana-ML) enhancements

The full list of new methods and enhancements with Hana-ML 2.20 is summarized in the changelog for Hana-ml 2.20.240319 as part of the documentation. The key enhancements in this release include

Time series analysis and forecasting methods

Time series permutation feature importance analysis
Time series outlier detection with voting
Segmented (massive) online Bayesian Change Point Detection

Auto-ML configuration and methods enhancements

Updated Auto-ML configuration dictionary-templates with new operators and random search optimization support for e.g. small time series configurations
Enhanced Auto-ML configuration option for setting connection constraints during optimization of multi-operator pipelines and visualization of pipeline connection scores between operators
Support algorithm-specific parameters with Auto-ML predict-calls, relevant for both pipeline predict and Auto-ML methods.
Enhanced progress monitor for Auto-ML to display at anytime and log management methods, allowing to set log levels, persist progress logs clean up logs and more.

Exploratory data analysis and visualization enhancements

New Bubble Plot and Parallel Co-ordinate Plot

You can find an examples notebook illustrating the highlighted feature enhancements here 24QRC01_2.20.ipynb.

SAP HANA Cloud, SAP HANA database Python Machine Learning

New Machine Learning features in SAP HANA Cloud 2024 Q1

Classification and Regression enhancements

Auto-ML and ML pipeline function improvements

Text Processing

New financial data analysis functions

Python ML client (Hana-ML) enhancements

Introduction of the end-to-end Machine Learning operations with SAP AI Core

SAP AI Core & SAP AI Launchpad - a visual introduction (Part 1)

Contextual AI – SAP’s first open source machine learning library for explainability