Fairness in Machine Learning - A New Feature in SA...

xinchen · ‎12-08-2023

We are excited to introduce Fairness in Machine Learning (FairML) - a new feature developed to promote fairness in machine learning within the SAP HANA Predictive Analysis Library (PAL) in 2023 QRC04.

Fairness is a critical subject within human society. The pursuit of fairness is essential in many social areas involving resource allocation, such as college admission, job candidate selection and personal credit evaluation and many more. In these areas, it is important to ensure that decision-makers are equipped with a sense of fairness to prevent any undue advantage to a particular group or individuals.

As our society progresses, its complexity is also increasing. As a result, many automated decision-making systems, powered by machine learning and AI, have been developed. These systems come initially just as helpers or aiding components, yet there is a clear trend that they have began to play more significant roles. For instance, many HR departments have implemented systems for automated filtering of job applicants in recent years. In the US, a software called PredPol (short for "Predictive Policing") estimates potential crime rates across different regions based on historical data. A tool named COMPAS is employed by US courts to assess the probability of a defendant becoming a repeat offender.

However, a major concern with these automated decision-making systems is their lack of fairness awareness. Unlike many human decision-makers, most of these systems are constructed without a conscious ethos of fairness or rely on limited, usually inadequate, fairness-aware mechanisms. Human oversight is therefore a crucial ingredient to Responsible AI systems. Moreover in the past and the current upwind of AI adoption, politics, law and legislation is very active aiming to specify general guardrails and red lines in applying AI systems in processes affecting humans. These rules and legislation can be very use case specific protecting very specific groups. Therefore local legislation must always be considered with respect to use of specific attributes in a use case.

For SAP, AI fairness is one of the three pillars of SAP AI ethics policy and a fundamental principle of AI ethics which aims to prevent discrimination by AI systems against specific demographic groups or individuals based on protected attributes like gender, ethnicity, or religion.

FairML within the SAP HANA Cloud Predictive Analysis Library (PAL) strives to mitigate the unfairness resulting from potential biases within datasets related to protected classes. The PAL fairness framework offers flexibility by supporting various machine learning models and does not require the apply- or predict-time access to the protected attribute(s). Currently, the framework primarily supports the sub-model of Hybrid Gradient Boosting Tree (HGBT), particularly for binary classification and regression scenarios. By incorporating fairness constraints like demographic parity and equalized odds during model training, PAL FairML produces inherently fair models. Furthermore, the machine learning fairness function is designed to be user-friendly and easily accessible. It can be invoked through the standard database SQL interface or the Python Machine Learning client for SAP HANA.

In the following sections, we will use a publicly accessible synthetic recruiting dataset as a case study to explore potentially fairness-related harms in classic machine learning models, and then how to mitigate such unfair bias with PAL's new feature of FairML. Initially, we will train a SAP PAL HGBT model as a baseline and subsequently, FairML is employed to train a model aimed at mitigating bias.

Disclaimer and important Note

This following example is merely a demonstration of the method and should not be considered as comprehensive guidance for real-world applications. Users should exercise caution, taking into account the laws and regulations of their respective countries and the appropriateness of the usage scenario, when deciding to use this algorithm.

Data Description

In the realm of fairness in machine learning, public datasets often pose several challenges, like the absence of protected attributes (which may be restricted in certain jurisdictions) or issues regarding data quality. Therefore, to illustrate our new method, a synthetic recruiting dataset is used which is derived from an example at the [Centre for Data Ethics and Innovation].

This synthetic dataset shall describe individual job applicants with attributes referring to their experience, qualifications, and demographics. The website also provides an in-depth method for data generation designed to manifest data in such a way that both the distinctive features and labels reflect specific unfair biases.

In this dataset, two typical sensitive attributes: ‘gender’ and ‘ethical_group’, are used as a basis for generating bias in other features such as years of experience, IT skills, income and referral status. These kinds of variables that are strongly correlated with sensitive data are typically described as proxy attributes. However, due to the potentially complex relationship between sensitive and proxy attributes, the latter may not always be easily identified and thus could introduce bias into the modeling process.

We have settled on the following 13 variables (Column 2 to Column 14) as ones that might be relevant in an automated recruitment setting. The first column is ID column and the last one is target variable 'employed_yes'.

ID: ID column
gender : Femail and male, identified as 0 (Female) and 1 (Male)
ethical_group : Two ethic groups, identified as 0 (ethical group X) and 1 (ethical group Y)
years_experience : Number of career years relevant to the job
referred : Did the candidate get referred for this position
gcse : GCSE results
a_level : A-level results
russell_group : Did the candidate attend a Russell Group university
honours : Did the candidate graduate with an honours degree
years_volunteer : Years of volunteering experience
income : Current income
it_skills : Level of IT skills
years_gaps : Years of gaps in the CV
quality_cv : Quality of written CV
employed_yes : Whether currently employed or not (target variable)

A total of 10,000 instances have been generated and the dataset has been divided into two dataframes: df_train (80%) and df_test (20%).

Data Exploration, Correlations and Selection Rate

First, let us explore how the training dataframe df_train looks like.Fig. 1. The first 5 rows of df_train

Then, evaluate correlation of features and protected attributes and avoid usage of intercorrelated indicators, e.g. correlate numerical columns.

Fig. 2. Correlation plot

Although in the data generation process, all 11 variables are related to the two sensitive variables, the values in the first two rows reveal that ~none of these 11 variables have a moderate correlation (correlation coefficient > 0.4) with the two sensitive variables.

In addition, it is important to note that in the real world, income is often reported and considered to have differences across genders ("gender pay gap"). In the data generation process, we introduced an 8% difference between male and female groups for income. Therefore, in this case, we consider 'income' as a proxy attribute and will exclude it from the subsequent model training, this illustrates the practice to also explicitly exclude proxy attributes.

Furthermore, it should be acknowledged that in real-world scenarios, if the relationship between proxy attributes and sensitive attributes is covert and cannot be identified and removed from the data preprocessing, bias in the data may still be present in the model.

When discussing fairness in AI systems, the first step is to understand the potential harms that the system may cause and how it can be identified. In the case of recruitment, our focus would be on whether different groups of people have similar selection rates. For example, if there is a significant difference in selection rates for different genders or ethical groups, the model may cause harm by discriminate against respective individuals proposing against their employment.

To illustrate this, the figures below displays the selection rates of groups for sensitive variables - 'gender' and 'ethical_group'. The selection rate refers to the proportion of individuals who received a job offer out of the total number of individuals in each group.

Fig. 3 Selection rate of different groups

The figure in the left reveals the selection rate for job offers between two gender groups - (1: male) and (0: female). In the training data, a significant disparity exists between the two groups, with men receiving job offers at a rate of 43.2%, while women are only offered jobs at a rate of 27.4%.

Similarly, the figure in the right indicates that there is a noticeable disparity between the selection rates of two ethical groups (0: ethical group X, 1: ethical group Y). Ethical group Y has a selection rate of 47.3%, which is approximately twice as high as that of ethical group X, which has a selection rate of 23.6%.

Building Machine Learning Classification Models

Training a "Regular" PAL HGBT Model

Firstly, we will train a fairness-unaware model - a 'regular' HGBT (Hybrid Gradient Boosting Tree) model using cross-validation, hyperparameter grid search and observe the performance of the model. The training set here is a dataframe named 'hdf_train_fair'. Sensitive variables 'gender', 'ethical_group' and proxy attribute 'income' have been excluded from this dataframe.

from hana_ml.algorithms.pal.unified_classification import UnifiedClassification
>>> uhgc = UnifiedClassification(func='HybridGradientBoostingTree',
random_state = 2024,
fold_num = 5,
resampling_method='cv',
param_search_strategy='grid',
evaluation_metric='error_rate',
ref_metric=['auc'],
param_values={'learning_rate': [0.01, 0.1, 0.4, 0.7],
'n_estimators': [30, 50, 70],
'max_depth': [5, 10, 15],
'split_threshold': [0.1, 0.4, 0.7]})
>>> uhgc.fit(data=hdf_train_fair, # in hdf_train_fair, sensitive variables and proxy attribute have been excluded.
key= 'ID',
label='employed_yes',
partition_method='stratified',
partition_random_state=1,
stratified_column='employed_yes',
ntiles=20,
build_report=True)

Then, we capture the optimal parameters (as shown in the Fig.4) from HGBT hyperparameter-search and apply these optimal parameters in the FairML method.

Fig.4 The optimal parameters

The next step is to evaluate model generalization using the test data samples (data not used during training) and the score-function.

# Test model generalization using the hdf_test_fair (data not used during training) and the SCORE-functions
# in hdf_test_fair, sensitive variables and proxy attribute have been excluded.
>>> scorepredictions, scorestats, scorecm, scoremetrics = uhgc.score(data=hdf_test_fair, key='ID', label='employed_yes', ntiles=20, impute=True, thread_ratio=1.0, top_k_attributions=20)

Debrief the regular HGBT model

We will delve into the details of the regular HGBT model through a model report and a SHAP report.

Fig. 5. Model reportIn the model report above, the model’s various performance metrics from training and scoring can be observed. For instance, in the 'Scoring Statistic' page shown in Fig.5, the model demonstrates high values for metrics such as AUC (Area Under the Curve) and Accuracy, high precision and recall for Class 0 and relatively good precision and recall for Class 1. This indicates that the trained HGBT model is performing reasonably well.Fig. 6. SHAP report

In this SHAP summary report shown in Fig.6, the contribution of different features to the prediction can be observed in the beeswarm plot. The top three variables that have the most significant impact on the prediction are 'years_experience', 'referred', and 'gcse'. Taking 'years_experience' as an example, red signifies a longer work experience while blue indicates a shorter one. Since the blue falls in the negative value region and the red is distributed in the positive value region, this suggests that the longer the work experience, the more likely one is to receive a job offer. This trend is consistent with the settings during the dataset construction. Similarly, consistent results can be observed in the ‘years_experience’ graph in the dependence plot.

In the following Fig.7 and Fig.8, we display and compare various metrics of scorepredictions concerning sensitive variables.

Fig. 7. Metrics in an unmitigated HGBT model for gender

Fig. 8. Metrics in an unmitigated HGBT model for ethical group

In the two figures showcased, the model’s accuracy for both categories is high (>80%). A substantial difference in the selection rate exists among males and females as well as betweem ethical groups, with men and ethical group Y demonstrating a higher selection rate. This disparity is anticipated, as such patterns have already been observed in the dataset.

Training a PAL FairML-HGBT Model

In this section, we employ FairML to mitigate the bias observed in the data and the regular HGBT model, to counter the perpetuation or amplification of the existing demographic disparities, the unfair selection rates. For binary classification, various constraint options are available such as: “demographic_parity”, “equalized_odds”, “true_positive_rate_parity”, “false_positive_rate_parity”, and “error_rate_parity”. For regression tasks, an option “bounded_group_loss” is supported. The selection of constraints depends on the inherent bias of the problem.

In the recruiting case, we aim to achieve similar selection rates across various demographic groups. To reach this goal, we configure 'demographic_parity' as the designation for the 'fair_constraint' parameter. Of particular interest is the use of the parameter 'fair_exclude_sensitive_variable', which indicates while the sensitive attribute(s) is used to control the fairness condition, it is not used as part of the actual model and thus not used when the model is applied for predictions. Note, the proxy attribute 'income' is exlcuded in both training set and testing set.

>>> from hana_ml.algorithms.pal.fair_ml import FairMLClassification
>>> hgbt_params = dict(n_estimators = int(N_ESTIMATORS), split_threshold=SPLIT_THRESHOLD,
learning_rate=LEARNING_RATE, max_depth=int(MAX_DEPTH), random_state=1234)
>>> hgbt_fm = FairMLClassification(fair_submodel='hgbt',
fair_constraint="demographic_parity",
fair_exclude_sensitive_variable=True, ### exclude sensitive attributes from model
**hgbt_params)
>>> hgbt_fm.fit(data=df_train.deselect(proxy_vars),
key='ID', label='employed_yes',
fair_sensitive_variable=sen_vars) ### indicate which attribute is sensitive and is used to build up fairness condition upon them

Use the new FairML-model to make predictions and obtain the results.

>>> print(excluded_features)
>>> result_fml = hgbt_fm.predict(df_test.deselect(excluded_features+["employed_yes"]), key='ID')
>>> result_fml.head(2).collect()

Fig. 9. FairML predict result

Debrief the FairML HGBT model

The following plots show the metrics of the FairML-HGBT model for different demographic groups.

Fig. 10. Metrics in a mitigated HGBT model for gender Fig. 11. Metrics in a mitigated HGBT model for ethical group

Comparing both models in Selection Rate and Accuracy

In this section, we aim to compare the performance of two models - a regular HGBT model and the FairML-HGBT model in terms of accuracy and selection rate on the testing dataset. This comparison will provide a more intuitive understanding of how the FairML-HGBT model addresses fairness concerns while maintaining model performance.

Fig. 12. Metrics comparision for gender Fig. 13. Metrics comparision for ethical group

By using the FairML approach, we can observe a reduction in the disparity of selection rates between different groups clearly. For example, the selection rate for men has decreased, and the selection rate for women has increased. It is expected that the FairML-HGBT model may experience a slight decrease in accuracy when compared to the regular HGBT model.

Conclusion and Disclaimer

In this blog, we introduced the newly provided FairML function with SAP HANA Cloud PAL in 2023 Q4. Using a synthetic recruiting data, we demonstrated how the SAP HANA FairML approach can be used to mitigate bias in a machine learning model, thereby promoting a more equitable AI environment.

It is essential to differentiate the notions of unfairness in the machine learning field from discrimination under laws and regulations. It is important to note that while unfairness in machine learning can contribute to discriminatory outcomes. While such discriminatory outcomes of an AI system might not necessarily be considered as illegal discrimination under applicable laws and regulation, however from an ethical and responsible AI perspective (or company policy) it might still be regarded unacceptable. Therefore, caution should be exercised, and compliance with applicable laws and regulations is crucial, as improper usage may unintentionally result in discriminatory outcomes.

The FairML function assists with data analysis and prediction but has limitations, relying on the quality and completeness of data inputs. Although efforts have been made to reduce bias and promote fairness, achieving a 100% elimination of bias is not possible. Consequently, the function’s outputs may still reflect varying degrees of inequality. The application of any mitigation function may further also introduce new negative effects impacting other minority groups or individuals. It is important not to solely rely on this tool for critical decision-making or as the sole determinant of conclusions.

Users are responsible for verifying results and ensuring that decisions made from the function comply with local laws and regulations. Caution should be exercised when using the function, which should supplement, rather than replace, human judgment, particularly in cases involving personal, legal, or ethical implications. Creators or providers of this function do not assume liability for misunderstandings, misapplications, or misinterpretations resulting from its usage, nor for any harm, damage, or losses incurred. In conclusion, we advise caution and skepticism when interpreting results and verifying information in critical matters.