Community challenge alert! The SAP HANA Cloud Mach...

susenpoppe · ‎11-17-2022

Prepare yourself (Prerequisites blogpost) and watch our kick-off call or read our kick-off blogpost.

During the challenge you will get the opportunity working hands-on with SAP HANA Cloud’s machine learning capabilities 🤖🎯.

Besides our open office hours, this thread is the place to ask questions and share your experiences with other machine learning enthusiasts. The challenge team is happy to answer any question you may have in the comments below.

Will you accept the challenge?

ChristophMorgen · ‎11-28-2022

Hi,

You’ve signed up for ML Community challenge "I quit! - How to prevent employee churn” kick-off on November 28th, that’s fantastic!

Review this blog for preparation details: Prerequisites for the SAP HANA Cloud Machine Learning Challenge “I quit!” – How to prevent employee .... As you will need an SAP HANA Cloud instance (Free tier or trial are not compatible with the challenge contents), and if you don’t have an adequate set-up yourselves, please reach out to the team SAPHANACloud@sap.com to get access to a suitable instance.

Looking forward to see you joining our challenge!

Best regards,

Christoph

ChristophMorgen · ‎11-29-2022

Join our open office hour calls, starting today

During the challenge you can join our experts to get your open questions answered and address any issues you are dealing with regarding the challenge. Please find the slots offered below:

November 29, 2022, 9AM CET (5PM KST) | Register here
December 1, 2022, 5PM CET (11AM EST) | Register here
December 6, 2022, 9AM CET (5PM KST) | Register here
December 8, 2022, 5PM CET (11AM EST) | Register here
December 13, 2022, 9AM CET (5PM KST) | Register here
December 15, 2022, 5PM CET (11AM EST) | Register here

Or simply ask you questions here in the forum!

WenjingLiu · ‎11-30-2022

Hi,
I would need to check the data uploaded onto the HANA Cloud database. How can I access SAP HANA Cloud Cockpit or SQL console (SAP HANA Data Explorer) for the following instance? Thanks!

8c7eed2e-f460-4a82-abfa-e3ede36923d7.hna1.prod-eu10.hanacloud.ondemand.com

ChristophMorgen · ‎11-30-2022

Hi, we haven't set up full BTP applications like HANA Cloud Database Explorer available on our side for the participants. However, if you have a local HANA Cockpit / Database Explorer, you can simply establish a SQL Connection to the HANA Cloud instance from there. Alternatively, you can use a tools like DBeaver locally installed and connect (see blog). Certainly from the python side, once the data is uploaded, and you have the dataframe created, you can use dataframe methods to access and check the data, e.g. mydataframe.head(10).collect() would issue a Select top 10 * from <table>. you can see some method examples in Yannick's blog at "How can we explore our data and react to data quality issues early".

Christoph

Vitaliy-R · ‎11-30-2022

If you have a BTP account, e.g. in BTP Trial or a Free Tier, then could you try if generic DBX opens for you at https://hana-cockpit.cfapps.us10.hana.ondemand.com/hrtt/sap/hana/cst/catalog/index.html?

Alternatively, you can use DBeaver -- an open-source IDE for databases: https://blogs.sap.com/2020/04/20/secure-connection-from-dbeaver-to-sap-hana-cloud/

Regards,
-Witalij

ChristophMorgen · ‎12-01-2022

As another helper to get everyone started, find another employee churn data upload and exploration using HANA dataframes notebook in our challenge repository

Sergiu · ‎12-01-2022

Hi.
In pandas dataframe data can be modified with inplace = True or by assigning results to the same dataframe, example df_data = df_data.replace(to_replace = np.nan, value = 'MSV').
In hana dataframe COLLECT allows to show data, but the inplace = True is no supported and assigning modified data back to same dataframe is not allowed, example data_hdf = data_hdf.fillna('MSV').collect() is not supported.
Does that mean that data has to be modified in pandas and updated to hana wtih dataframe.create_dataframe_from_pandas?
COLLECT shows data. What are the methods that allow to modify hana dataframe? Please recommend notebooks with examples on how to modify hana dataframe.
Thanks. Sergiu.

Vitaliy-R · ‎12-01-2022

HANA dataframe does not store data, as it is with the Pandas or R dataframe. I had it explained for example here: https://blogs.sap.com/2020/11/10/multi-model-in-hana_ml-2.6-for-python-dataframes/

HANA dataframe in Python represents a SELECT statement to be executed in SAP HANA database and it is executed only once a method like `collect()` is called.

There is a method `save()` that creates a table or a db view for data returned by a HANA dataframe to the database :

https://help.sap.com/doc/1d0ebfe5e8dd44d09606814d83308d4b/2.0.06/en-US/hana_ml.dataframe.html#hana_m...

Sergiu · ‎12-01-2022

Pandas dataframe:
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]})
df['C] = 1
df['C'] = df['C] + 2

hana-ml Tutorial - Dataframe:
d_train_new = d_train.add_constant('C',1)

I couldn't find in tutorial how to modify a column, tried as in pandas but I got an error.
d_train_new['C'] = d_train_new['C'] + 1
TypeError: __getitem__ argument not understood: 'C'

Please help with examples on how to modify hana dataframe column.

ChristophMorgen · ‎12-01-2022

you can simply use the select dataframe method to create new columns

df.select('*', ('B * 4', 'D')).collect()

Sergiu · ‎12-01-2022

In presented notebooks in hana model.fit the parameter key is mandatory.
In Python machine libraries key parameter is not necessary.
What is the role of the key hana libraries?
MODEL_SEARCH.fit(data=d_train, key= ID,
label=target,
partition_method='stratified',
partition_random_state=1,
stratified_column=target,
build_report=False)

Thanks. Sergiu.

ChristophMorgen · ‎12-01-2022

Hi Sergiu, key is actually not mandatory with fitting a model. However if the input dataframe to a fit-call contains and Id column, you either have to drop it from the dataframe beforehand, or identify the column as key / id column so it's not treated as a feature column. With the predict / score methods e,g. of unified classification, an id-column is required.

Hope this explains.

Christoph

Sergiu · ‎12-02-2022

Estimate car price - Load data to SAP HANA

Pandas:
# df - pandas dataframe
df_data.insert(0, 'CAR_ID', df_data.reset_index().index)
OR
df_data[COL_CAR_ID] = df_data.index

Please help with examples on how to copy in hana_ml.dataframe index to column.

Thanks. Sergiu.

Sergiu · ‎12-04-2022

# df - hana_ml.dataframe
df.add_id(id_col='ID')

ChristophMorgen · ‎12-05-2022

Great you found the solution in between!!! 😊

YannickSchaper · ‎12-05-2022

Today we share additional data, which was discovered from our data engineer. The additional data for example includes if an employee recently updated their LinkedIn Profile or took part in an HR training. During the open office calls this week on Tuesday/Thursday (see this post for timings), we will also demo how to use Unified Classification with PAL Hybrid Gradient Boosting (see example) and the new PAL AutoML-Classification feature (see example, blog, community call introduction, more automl examples), as two additional approaches to tackle the employ churn challenge and achieve best results …

Have fun working on the challenge!

susenpoppe · ‎12-06-2022

Join us in 5 minutes for a short demo and to address your open questions in our open office hour.

Sergiu · ‎12-06-2022

susenpoppe · ‎12-06-2022

Hi Sergiu, Please try this link: https://sap-se.zoom.us/j/96194777489?pwd=NmVIRHlFcUEvT3RtS3VhdTgySU5pZz09

ChristophMorgen · ‎12-08-2022

Today we have our second open office session (see this post for timings), have you already include the additonal data to improve your model? Maybe HRTRAINING, SICKDAYS, ... adds additional insights, you should try it out!

Sergiu · ‎12-05-2022

hana_ml.dataframe - replace allows to replace values in the same column from a dictionary.

Examples
>>> df.collect()
  Aa Bb Cb
0 0  5  a
1 10 0  b
2 2  7  c
3 3  8  d
4 4  9  e

>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect()
  Aa  Bb Cb
0 100 5  a
1 10  0  b
2 2   7  c
3 3   8  d
4 400 9  e

Pandas
>>> df = pd.DataFrame([10,20,30,40,50], columns =['speed'])
>>> df.loc[(df["speed"] <= 20), "gear"] = 'd1'
>>> df.loc[(df["speed"] > 20) & (df["speed"] <= 40), "gear"] = 'd2'
>>> df.loc[(df["speed"] > 40), "gear"] = 'd3'
>>> df

  speed gear
0 10    d1
1 20    d1
2 30    d2
3 40    d2
4 50    d3

Please help with examples for ml_hana.dataframe on how to create an additional column with values based on conditions of other columns as in pandas.

Thanks. Sergiu.

Sergiu · ‎12-06-2022

hana_ml.dataframe.DataFrame - replace allows to replace values from same column from a dictionary.

Examples
--------
>>> df.collect()
    Aa  Bb  Cb
    0   0   5   a
    1   10  0   b
    2   2   7   c
    3   3   8   d
    4   4   9   e

>>> df.replace(to_replace={'Aa': {0: 100, 4: 400}}).collect()
    Aa  Bb  Cb
    0   100 5   a
    1   10  0   b
    2   2   7   c
    3   3   8   d
    4   400 9   e

Pandas
>>>df = pd.DataFrame([10,20,30,40,50], columns =['speed'])
>>>df.loc[(df["speed"] <= 20), "gear"] = 'd1'
>>>df.loc[(df["speed"] > 20) & (df["speed"] <= 40), "gear"] = 'd2'
>>>df.loc[(df["speed"] > 40), "gear"] = 'd3'
>>>df
  speed gear
0 10    d1
1 20    d1
2 30    d2
3 40    d2
4 50    d3

Please provide examples for hana_ml.dataframe.DataFrame on how to create conditional column based on values of another column.

Thanks. Sergiu.

Sergiu · ‎12-06-2022

df = pd.DataFrame([10,20,30,40,50], columns =['SPEED'])
df.loc[(df["SPEED"] <= 20), "GEAR"] = 'd1'
df.loc[(df["SPEED"] > 20) & (df["SPEED"] <= 40), "GEAR"] = 'd2'
df.loc[(df["SPEED"] > 40), "GEAR"] = 'd3'
df
  SPEED GEAR
0 10    d1
1 20    d1
2 30    d2
3 40    d2
4 50    d3

hana_tab = "T_GEAR"
df_remote = dataframe.create_dataframe_from_pandas(connection_context = conn, pandas_df = df, 
                                                           table_name = hana_tab,
                                                                force = True,
                                                              replace = False)

df_remote_new = df_remote.select (['SPEED','GEAR',("""CASE WHEN SPEED <=20 THEN 'g1'
                                                           WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
                                                           WHEN SPEED >40 THEN 'g3'
                                                      END""",'GEAR_HANA')])

df_remote_new.collect()

  SPEED GEAR GEAR_HANA
0 10    d1   g1
1 20    d1   g1
2 30    d2   g2
3 40    d2   g2
4 50    d3   g3

ChristophMorgen · ‎12-06-2022

Great to see pivoting from pandas dataframe code towards the HANA dataframe approach! 🙂 Excellent!

AndreasForster · ‎12-06-2022

In one of the open office hours a question came up on creating new features out of the existing columns, without data extraction.

Here is an example, that bundles Finland, Sweden and Norway for example into Scandinavia in a new column

df_remote = conn.table('EMPLOYEE_CHURN_ORG')
df_rem_prepped = df_remote.select(['EMPLOYEE_ID', 'AGE', \
("""CASE WHEN CURRENT_COUNTRY IN ('Sweden', 'Norway', 'Finland') THEN 'Scandinavia'
WHEN CURRENT_COUNTRY = 'UK' THEN 'United Kingdom'
WHEN CURRENT_COUNTRY IS NULL THEN 'Unknown'
ELSE CURRENT_COUNTRY END""", 'REGION_LOCAL')])
df_rem_prepped.select('REGION_LOCAL').distinct().collect()

Sergiu · ‎12-06-2022

How to include in result columns from list ls_col =['EMPLOYEE_ID', 'AGE'] instead of writing directly in select?

ChristophMorgen · ‎12-06-2022

You can make use of Python function strings like str.format() or f-strings (python background here) to apply parameters with hana-ml / dataframe methods.

See a hana-ml example here (not exact match to ls_col)

id='####'
hdf_new=hdf_remote.filter('"uuid" = \'{}\''.format(id)).sort('date', desc=False)

Sergiu · ‎12-06-2022

ls_col =['SPEED', 'GEAR']
#remove [ and ] from statement
df_remote_new = df_remote.select (LS_COL,("""CASE WHEN SPEED <=20 THEN 'g1'
                                                  WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
                                                  WHEN SPEED >40 THEN 'g3'
                                             END""",'GEAR_HANA'))

# include all fields
df_remote_new = df_remote.select ('*',("""CASE WHEN SPEED <=20 THEN 'g1'
                                               WHEN SPEED >20 AND SPEED <=40 THEN 'g2'
                                               WHEN SPEED >40 THEN 'g3'
                                          END""",'GEAR_HANA'))

Sergiu · ‎12-07-2022

HANA Cloud - Predicitive Analysis Library Hands On

from hana_ml.visualizers.dataset_report import DatasetReportBuilder
datasetReportBuilder = DatasetReportBuilder()
datasetReportBuilder.build(df_remote, key="PRODUCT_ID")

ERROR:
File ~\Anaconda3\envs\cln_ml_hana\lib\site-packages\hana_ml\visualizers\dataset_report.py:480, in DataAnalyzer.get_numeric_variable_distribution_data(self, column, bins)
    478 if bins > 1:
    479     bins = bins - 1
--> 480 _, _, bin_data = distribution_plot(data=data_,
    481                                    column=column,
    482                                    bins=bins)
    483 x_data = list(bin_data['BANDING'])
    484 y_data = list(bin_data['COUNT'])

File ~\Anaconda3\envs\cln_ml_hana\lib\site-packages\hana_ml\visualizers\eda_plotly.py:138, in distribution_plot(data, column, bins, title, x_axis_label, y_axis_label, x_axis_fontsize, x_axis_rotation, debrief, rounding_precision, replacena, fig, subplot_pos, **kwargs)
    136             bin_data = bin_data.append({"BANDING": item, "COUNT": 0}, ignore_index=True)
    137 bin_data.sort_values(by="BANDING", inplace=True)
--> 138 trace = go.Bar(x=bin_data['BANDING'], y=bin_data['COUNT'])
    139 if fig:
    140     fig.add_trace(trace, row=subplot_pos[0], col=subplot_pos[1])

NameError: name 'go' is not defined

Please help with a solution for the above error.

Thanks. Sergiu.

fazam · ‎12-07-2022

My suggestion would be to make sure plotly is working properly. Try adding the following code line

import plotly.graph_objs as go

Also, check to make sure that ipywidgets is installed.

ChristophMorgen · ‎12-08-2022

Hi Sergiu, we've just released a new hana-ml fix version on pypi to address the plotly dependency issue observed.

Christoph

Sergiu · ‎12-08-2022

#| top_k_attributions : int, optional
#| Specifies the number of features with highest attributions to output
>>> result = uc_rdt.predict(df_predict, key = "ID", top_k_attributions=10)

Please help to interpret when top_k_attributions = 1 OR top_k_attributions = 10 calculates same CONFIDENCE and outputs same number of features in shapley_explainer.

ChristophMorgen · ‎12-08-2022

The confidence is the overall predicted classification decision confidence (over all features). As we may have a large number of input features (e.g. >100), we want to restrict the output the feature values and their relative contribution to the local predicted classification decision to the most important ones, default the top 10. if you top_k_attributions=1 you would only output the single top most important feature and its relative importance.

Example Predict with reason_code and top_k_attribution for the PredictiveQuality Tutorial example

# Explore test data-subset predictions applying the trained model
features = hdf_test.columns
features.remove('QUALITY')
features.remove('PRODUCT_ID')
# Using the predict-method with our model object hgbc
pred_res = hgbc.predict(hdf_test.head(1000), key='PRODUCT_ID', features=features, impute=True, top_k_attributions=3 )
# Review the predicted results
pred_res.select('PRODUCT_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE' ).head(3).collect()

Sergiu · ‎12-08-2022

I suppose Confusion Matrix from UnifiedReport(model).build().display() is build from d_train (seen data) and only for part of d_train. Confusion Matrix total = 20% * d_train.count(). How to include all d_train or increase the %?

ChristophMorgen · ‎12-08-2022

Referring to our tutorial code example about predictive quality

print('Size of val subset: ' + str(hdf_val.count()))

Size of val subset: 12500

If I now print the confusion matrix after the fit

display(hgbc.confusion_matrix_.collect())

ACTUAL_CLASS | PREDICTED_CLASS| COUNT

0	0	9298
0	1	704
1	0	887
1	1	1611

That's exactly the 12500 we passed in to the fit as the validation subset, along with the 100k rows passed in to train the model.

Sergiu · ‎12-10-2022

PAL Tutorial - Unified Classification Hybrid Gradient Boosting - PredictiveQuality Example.ipynb

I couldn't find fit.hgbc cell with with validation set hdf_val.

Please update the notebook.

ChristophMorgen · ‎12-12-2022

If you download the notebook, in the hgbc.fit call, we input the union of hdf_val and hdf_train and with partition_method specify we've predefined the val-partition used during fit-validation. The fit method in unified classification, validates on pre-defined validation subset or can use part of the training-hdf.

hgbc.fit(data=df_trainval, 

         key='PRODUCT_ID', 

         ...

        partition_method='user_defined', purpose='TRAIN_VAL_INDICATOR')

Sergiu · ‎12-07-2022

Please help with examples regarding how to extract features contribution from UnifiedClassification('HybridGradientBoostingTree') .

In other models set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'}) can be used.

Thanks. Sergiu.

ChristophMorgen · ‎12-07-2022

The UnifiedClassification predict method, supports attribution_method (Saabas or Tree-SHAP) and top_k_attributions to output with the predict-call. Here you can find an example

# Using the predict-method with our model object hgbc
pred_res = hgbc.predict(df_test.head(1000), key='uuid', features=features, impute=True )

# Review the predicted results
pred_res.select('uuid', 'SCORE', 'CONFIDENCE', 'REASON_CODE', 
                ('json_query("REASON_CODE", \'$[0].attr\')', 'Top1'), 
                ('json_query("REASON_CODE", \'$[0].pct\')', 'PCT_1') ).head(3).collect()

Community challenge alert! The SAP HANA Cloud Machine Learning Challenge is about to start!