Debrief a model from a Jupyter notebook using the ...

marc_daniau · ‎02-20-2018

In a previous article, we saw how to train and save a classification model from a Jupyter notebook using the Python API of SAP Predictive Analytics. The next logical step in predictive modeling is, for the user, to look at the model performance indicators, visualize the ROC curve, discover which predictors contribute the most, check the correlated variables, analyze binned variables. How to achieve that with the Python API will be the focus of this post.

All the following snippets work the same whether your model was built from a notebook or from the SAP Predictive Analytics desktop application.

Initialization

We open a brand new Jupyter notebook and run the below code prior working with our model. You may need to adjust the three paths provided at the beginning so that they correspond to your own installation directories.

import sys

sys.path.append(r"C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\Python35")

import os

os.environ['PATH'] = r"C:\Program Files\SAP Predictive Analytics\Desktop\Automated\EXE\Clients\CPP"



AA_DIRECTORY = "C:\Program Files\SAP Predictive Analytics\Desktop\Automated"



import aalib



class DefaultContext(aalib.IKxenContext):

    def __init__(self): 

        super().__init__()



    def userMessage(self, iSource, iMessage, iLevel):

        print(iMessage)

        return True



    def userConfirm(self, iSource, iPrompt):

        pass



    def userAskOne(iSource, iPrompt, iHidden):

        pass



    def stopCallBack(iSource):

        pass



frontend = aalib.KxFrontEnd([])

factory = frontend.getFactory()

context = DefaultContext()



factory.setConfiguration("DefaultMessages", "true")

config_store = factory.createStore("Kxen.FileStore")

config_store.setContext(context, 'en', 10, False)

config_store.openStore(AA_DIRECTORY + "\EXE\Clients\CPP", "", "")

config_store.loadAdditionnalConfig("KxShell.cfg")

Loading the Predictive Model

Let’s retrieve the model we have saved previously.

We specify the location of the model, its name, and we open a store.

folder_name = r"O:\MODULES_PA/PYTHON_API/MY_MODELS"

model_name = "My Classification Model"

store = factory.createStore("Kxen.FileStore")

store.openStore(folder_name, "", "")

The method described below allows us to get the latest version of the model.

help(store.restoreLastModelD)

We load the model.

model = store.restoreLastModelD(model_name)

A First Look at the Model

We display the general properties of the predictive model.

model.getParameter("")



my_list = ["ModelName", "BuildDate", "BuildData", "ClassName/Default"]

for i in my_list:

  d_object = model.getParameter("Infos/" + i)

  d_value =  d_object.getNameValue().value

  print("{}: {}".format(i, d_value))

We are not sure what the name of the target for this model is. Let’s find out.

Note that a model can have several targets. In the present case, our model has only one target. In the following code, the value of the target name is assigned to the Python variable: target_col.

d_path = "Protocols/Default/Transforms/Kxen.RobustRegression/Results"

d_param = model.getParameter(d_path)

d_values = d_param.getSubEntries([""])

target_idx = 0

target_col = d_values[target_idx][0]

Because our target is made of two classes, we can show the proportion of positive and negative cases.

partition = 'Estimation'

d_path = "Protocols/Default/Variables/%s/Statistics/%s/Categories" % (target_col, partition)

d_param = model.getParameter(d_path)

d_values = d_param.getSubEntries(["", "Weight", "Frequency"]) 

v_sorted = sorted(d_values, key=lambda x: float(x[2]), reverse=True)



text= "Observations within the %s partition" % (partition) 

print('\x1b[1m'+ text + '\x1b[0m')

xs0 = [x[0] for x in v_sorted]

xs2 = [int(x[2]) for x in v_sorted]

xs3 = [float(x[3]) * 100 for x in v_sorted]

for x0, x2, x3 in zip(xs0, xs2, xs3):

    print("{} = {} ; {:0} rows ({:0.2f}%)".format(target_col, x0, x2, x3))

We check how the data was split for learning and validation purposes.

d_path = ("Protocols/Default/Variables/%s/Statistics") % target_col

d_param = model.getParameter(d_path)

d_values = d_param.getSubEntries(["", "WeightTotal"]) 



import pandas as pd

df = pd.DataFrame(list(d_values), columns=["Partition", "Type", "Observations"])



df.drop("Type", axis=1, inplace=True)

df['Observations'] = df['Observations'].astype(int)

df['In %'] = (df['Observations'] / df['Observations'].sum()).round(4)*100

df

An important information about a trained model is its performance indicators.

indicator = 'Ki'

d_path = "Protocols/Default/Variables/rr_%s/Statistics/Validation/Targets/%s/%s" % (target_col, target_col, indicator)

d_object = model.getParameter(d_path)

d_value =  float(d_object.getNameValue().value) * 100

print("Predictive Power (KI) is {:0.2f}%".format(d_value))

indicator = 'Kr'

d_path = "Protocols/Default/Variables/rr_%s/Statistics/Estimation/Targets/%s/%s" % (target_col, target_col, indicator)

d_object = model.getParameter(d_path)

d_value =  float(d_object.getNameValue().value) * 100

print("Prediction Confidence (KR) is {:0.2f}%".format(d_value))

The predictive power (accuracy) of our model is 80.97%. This indicator gives the percentage of information in the target that can be explained by the predictors. The prediction confidence is 99.62%. This second indicator evaluates the capacity of the model to reproduce the same accuracy when applied to a new dataset exhibiting the same characteristics as the training dataset. A model with a prediction confidence value of 95% or above is considered robust.

Contribution of Predictors

After seeing the overall model performance, often the user wants to know how much each predictor contributes to the target outcome. The contribution can be presented in a table.

d_path = ("Protocols/Default/Transforms/Kxen.RobustRegression/Results/%s/MaxCoefficients" % target_col)

d_param = model.getParameter(d_path)

d_values = d_param.getSubEntries(("Contrib"))

v_sorted = sorted(d_values, key=lambda x: float(x[1]), reverse=True)



df = pd.DataFrame(v_sorted, columns=["Variable", "Contribution"])

df['Contribution'] = df['Contribution'].astype(float)

df['Cumulative'] = df['Contribution'].cumsum()

df['Contribution'] = df['Contribution'].round(4)*100

df['Cumulative'] = df['Cumulative'].round(4)*100

non_zero = df['Contribution'] != 0

dfs = df[non_zero].sort_values(by=['Contribution'], ascending=False)

dfs

7 variables out of 14 were kept by the auto-selection process.

We plot the contributions in a bar chart.

import matplotlib.pyplot as plt

c_title = "Contributions to %s" % target_col

dfs = dfs.sort_values(by=['Contribution'], ascending=True)

dfs.plot(kind='barh', x='Variable', y='Contribution', title=c_title, legend=False, fontsize=12)

plt.show()

Performance Curve

We draw a ROC curve to visualize the performance of the classifier.

def plot_profit_curve(model,

                      variable_name,

                      curve_type,

                      target_variable = "",

                      target_key_variable = "",

                      num_points = 20,

                      data_set = "",

                      based_on_frequency = True,

                      use_weight = True,

                      use_groups = True):



    proto = model.getProtocolFromName("Default")

    curve = proto.getProfitCurve2(variable_name, target_variable,

                                  target_key_variable, num_points,

                                  curve_type, data_set, based_on_frequency,

                                  use_weight, use_groups)

    pd_curve = pd.DataFrame(list(curve[1:]), columns = curve[0])



    fig = plt.figure(figsize = (14,8))

    fig.gca().set_facecolor('none')

    fig.gca().spines['left'].set_facecolor('grey')

    fig.gca().spines['bottom'].set_facecolor('grey')

    fig.gca().spines['top'].set_facecolor('none')

    fig.gca().spines['right'].set_facecolor('none')



    mygreen = '#00ff00'

    plt.fill_between(pd_curve['Frequency'], pd_curve['Random'], pd_curve['Wizard'], color=mygreen, alpha = 0.1)

    plt.fill_between(pd_curve['Frequency'], pd_curve['Random'], pd_curve['Validation'], color="blue", alpha = 0.1)



    pl_rnd, = plt.plot(pd_curve['Frequency'], pd_curve['Random'], color='red', label='Random', lw=1)

    pl_wiz, = plt.plot(pd_curve['Frequency'], pd_curve['Wizard'], color=mygreen, label='Wizard', lw=1)

    pl_est, = plt.plot(pd_curve['Frequency'], pd_curve['Validation'], color='blue', label='Model', lw=1)



    plt.grid(color='b', alpha=0.9, linestyle='dashed', linewidth=0.2)

    plt.xlabel('False Positive Rate', fontsize=14)

    plt.ylabel('True Positive Rate',  fontsize=14)

    plt.title('ROC Curve for ' + variable_name.lstrip('rr_'), fontsize=16, fontweight='bold')

    plt.xticks(pd_curve['Frequency'], pd_curve['Frequency'], rotation=30, fontsize=12)

    plt.legend(handles=[pl_rnd, pl_wiz, pl_est], loc=4, fontsize=16)

    plt.show()



plot_profit_curve(model, variable_name="rr_" + target_col, curve_type=aalib.Kxen_roc)

The green curve represents the hypothetical perfect model, the blue curve represents our model, and the red line represents a random guess.

More information about the ROC curve can be found here.

Correlation between Predictors

At training time, Automated Analytics evaluates the correlation strength between candidate predictors. Only correlations with a coefficient above the model threshold are kept. This threshold is set, by default, to 0.5 in absolute value.

# Read Threshold Value

d_path = ("Protocols/Default/Transforms/Kxen.RobustRegression/Parameters/LowerBound")

d_param = model.getParameter(d_path)

d_value = float(d_param.getNameValue().value) 

# Get Pairs with Coefficient

d_path = ("Protocols/Default/Transforms/Kxen.RobustRegression/Results/%s/Correlations" % target_col) 

d_param = model.getParameter(d_path)

d_values = d_param.getSubEntries(["", "Var1", "Var2"])

v_sorted = sorted(d_values, key=lambda x: float(x[1]), reverse=True)

# Print Pairs

text= "Correlations above %s" % (d_value)

print('\x1b[1m'+ text + '\x1b[0m')

xs1 = [round(float(x[1]),3) for x in v_sorted]

xs2 = [x[2] for x in v_sorted]

xs3 = [x[3] for x in v_sorted]

for x1, x2, x3 in zip(xs1, xs2, xs3):

    print("{:>40} : {:0.3}".format(x2 + ' - ' + x3, x1))

Statistics on Binned Variables

Automated Analytics takes in charge various data pre-processing operations like encoding and binning. The trained model stores summary statistics computed for each predictor by bin (aka category). One can list all the categories of a given predictor with their statistics.

We choose a continuous variable: capital gain.

partition = 'Estimation'

predictor = 'capital-gain'



d_path = ("Protocols/Default/Variables/%s/Statistics/%s/Categories") % (predictor, partition)

d_param = model.getParameter(d_path)

d_path = ("Targets/%s/NormalProfit") % target_col

d_values = d_param.getSubEntries(["", "Weight", "Frequency", d_path])



df = pd.DataFrame(list(d_values), columns=["Category", "Type", "Observations", "In %", "Normal Profit"])

df.drop("Type", axis=1, inplace=True)

df['Observations'] = df['Observations'].astype(int)

df['In %'] = df['In %'].astype(float).round(4)*100

df['Normal Profit'] = df['Normal Profit'].astype(float).round(3)

df

After generating categories, Automated Analytics generates groups which are in fact compressed categories. These groups appear in the final scoring equation. The table we have just built for categories can also be done for groups.

d_path = ("Protocols/Default/Variables/%s/Statistics/%s/Targets/%s/Groups") % (predictor, partition, target_col)

d_param = model.getParameter(d_path)

d_path = ("Targets/%s/NormalProfit") % target_col

d_values = d_param.getSubEntries(["", "Weight", "Frequency", d_path])



df = pd.DataFrame(list(d_values), columns=["Group", "Type", "Observations", "In %", "Normal Profit"])

df.drop("Type", axis=1, inplace=True)

df['Observations'] = df['Observations'].astype(int)

df['In %'] = df['In %'].astype(float).round(4)*100

df['Normal Profit'] = df['Normal Profit'].astype(float).round(3)

df

Binned Variable Distribution for each Class

In this last section, we show how to list categories with their positive and negative cases.

We choose a categorical variable: occupation.

predictor = 'occupation'



d_path = "Protocols/Default/Variables/%s/Statistics/%s/Categories" % (target_col, partition)

d_param = model.getParameter(d_path)

d_values = d_param.getSubEntries(["", "Code"]) 

target_v1 = d_values[0][0]

target_v2 = d_values[1][0]



d_path = ("Protocols/Default/Variables/%s/Statistics/%s/Categories") % (predictor, partition)

d_param = model.getParameter(d_path)

d_path  = ("Targets/%s/TargetCategories/%s/Weight") % (target_col, target_v1)

d_path2 = ("Targets/%s/TargetCategories/%s/Weight") % (target_col, target_v2)

d_values = d_param.getSubEntries(["", d_path, d_path2]) 



df = pd.DataFrame(list(d_values), columns=["Category", "Type", target_v1, target_v2])

df.drop("Type", axis=1, inplace=True)

df[target_v1] = df[target_v1].astype(int)

df[target_v2] = df[target_v2].astype(int)

df['Total'] = df[target_v1] + df[target_v2]

df[target_v1 + ' (in %)'] = (df[target_v1] / df['Total']).round(4)*100

df[target_v2 + ' (in %)'] = (df[target_v2] / df['Total']).round(4)*100

df

We plot two bar charts to visually compare the distribution.

text= "Observations by Category for %s" % (predictor)

print('\x1b[31;1m'+ text + '\x1b[0m')

c_title = "%s = %s" % (target_col, target_v1)

df.plot(kind='barh', x='Category', y=target_v1, title=c_title, 

  alpha=0.6, width=1.0, edgecolor='grey', legend=False, fontsize=12)

c_title = "%s = %s" % (target_col, target_v2)

df.plot(kind='barh', x='Category', y=target_v2, title=c_title, 

  alpha=0.6, width=1.0, edgecolor='grey', legend=False, fontsize=12)

plt.show()

An equivalent analysis can be done by group.

d_path = ("Protocols/Default/Variables/%s/Statistics/%s/Targets/%s/Groups") % (predictor, partition, target_col)

d_param = model.getParameter(d_path)

d_path  = ("Targets/%s/TargetCategories/%s/Weight") % (target_col, target_v1)

d_path2 = ("Targets/%s/TargetCategories/%s/Weight") % (target_col, target_v2)

d_values = d_param.getSubEntries(["", d_path, d_path2]) 



df = pd.DataFrame(list(d_values), columns=["Group", "Type", target_v1, target_v2])

df.drop("Type", axis=1, inplace=True)

df[target_v1] = df[target_v1].astype(int)

df[target_v2] = df[target_v2].astype(int)

df['Total'] = df[target_v1] + df[target_v2]

df[target_v1 + ' (in %)'] = (df[target_v1] / df['Total']).round(4)*100

df[target_v2 + ' (in %)'] = (df[target_v2] / df['Total']).round(4)*100

df

Again, two separate charts for visual comparison.

text= "Observations by Group for %s" % (predictor)

print('\x1b[31;1m'+ text + '\x1b[0m')

c_title = "%s = %s" % (target_col, target_v1)

df.plot(kind='barh', x='Group', y=target_v1, title=c_title, 

  alpha=0.6, width=1.0, edgecolor='grey', legend=False, fontsize=12)

c_title = "%s = %s" % (target_col, target_v2)

df.plot(kind='barh', x='Group', y=target_v2, title=c_title, 

  alpha=0.6, width=1.0, edgecolor='grey', legend=False, fontsize=12)

plt.show()

In a next post, we will explain how you can export the scoring equation of a trained model.

ikiselev · ‎02-22-2018

Very interesting - thank you for sharing !

Can we import the model into SAC as well?

marc_daniau · ‎02-23-2018

Hi Igor,

There is no such feature "import predictive model" in SAC at the moment.

Do you have a particular scenario in mind ?

Kind Regards

Debrief a model from a Jupyter notebook using the Python API of SAP Predictive Analytics

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win