Skip to content

Latest commit

 

History

History

Feature_Selection_Collection

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Feaure Selection Collection

In industry, many times we need to generate features, understanding them and generate more to improve model performance. I'm taking notes of some method that may help do further exploration.

SHAP

Debugging Tips

  • When import shap, if you got error saying "no numba.core"
    • Try to run pip install numba==0.54.0 --user
  • Shap errors got when model problem is binary classification
    • Choose shap_values[1] as shap_values input
  • Confusing input for beeswarm and summary_plot

Other Tips

  • How to calculate shaply value step by step: https://www.youtube.com/watch?v=fbrVvMU8T6o

    • It originated from game theory. The basic idea is, there are several players share icecream, shaply calculates, if remove one of the players, how much share the rest of players will get
  • It's amethod used to deal with the draw back of XGBoost feature selection

  • You know what, SHAP value came from game theory

    • "Shapley values correspond to the contribution of each feature towards pushing the prediction away from the expected value."
  • My practice code 1 - XGBoost Regressor

  • Github has blocked the loading of JS, in fact it provides a method to interact with each record and understand how the feature values affect the prediction shap JS

    • In this force plot, the "base value" of the estimated average predicted value from the model training (in new SHAP version, it's using leaf nodes and no longer the same as avg of forecasted values...), the "output value" is the predicted value of current observation. Pink is the position impact (dragging the prediction value higher or towards 1) while blue is the negative impact (dragging the prediction value lower or towards 0).
    • The length of the bar for each feature indicates to which extent the feature affect the forecasted value
    • output value = base value + sum(all features' SHAP values)
      • Because of this, sometimes when you got negative forecast values, you can shift the output value to the right and split the shifted difference to each feature (better to shift proportional to each feature's absoluve SHAP so that we can keep original feature importance as much as possible). By doing this the base value will stay the same, feature's impact visually stay almost the same and the forecasted value has been "corrected".
        • BTW, here's an explaination about why you may get negative forecasting values in a boosting regressor even when the training target values are all positive
  • SHAP decision plot

    • Just need trained model and feature values, no need lables in the data
    • Add link='logit' in the decision_plot() function will convert log odds back to prediction probabilities (SHAP value is log odds)
    • SHAP decision source code
      • In the decision plot, by default the features are ordered by importance order. When there are multiple records, the importance of a feature is the sum of absolute shap values of the feature
    • Include multiple records plot
    • For each record, it shows how does each feature value lead to the final prediction result
  • More Advanced SHAP Insights

  • SHAP for binary classification, by default, the generated SHAP values are log odds. Adding data, model_output and feature_dependence as below can convert SHAP values back to probabilities

explainer = shap.TreeExplainer(enc_model, data=encoded_X_test,
                               model_output='probability',
                               feature_dependence='independent')
  • The base value generated from TreeExplainer expected_value can be different from the average forecatsed result when using model predict(), when the TreeExplainer depends on some settings from the training data, such as leaf sample weights for random subsample
    • With SHAP version <=0.33, you can pass X_train to SHAP instead of using the trained model
    • Otherwise, the new version of SHAP is no longer the "average forecasted value", it's calcukated using leaf nodes
  • When the data is large, you can use clustered_df = shap_kmeans(df) and put this clustered_df in a shap explainer. This method helps speed up the computation
    • In SHAP, for each feature subset (2^m - 2) it perturbs the values of features and makes prediction to see how peturbing a feature subset changes the prediction of model. For each feature subset (e.g. [0,1,1,0,0,0] only perturbing feature 2nd and 3rd) you can replace the feature values by any of the values in the training set. By default it does that exhaustively for all points in training, therefore the total number of model predictions it evaluates is n2^m. So, we use shap.kmeans to only perturb based on some representatives (10 centroids instead of 1000 datapoints)
      • m is the feature number
      • N is the number of samples
      • Total number of model evaluation is N * (2^m - 2)
      • Although the value is randomly assigned for the perturbed values, it's choosing the possible value from the feature that appeared in the training data
    • There are different types of SHAP explainer
      • KernelExplainer is generic and can be used for all types of models, but slow.
        • That's also why when using TreeExplainer, you don't have to use shap.kmeans for large dataset, since it's fast
        • KernelExplainer is not applicable for more than 15 features
  • When doing the experiments of SHAP performance, there are multiple things can check
    • Time efficiency for different number of samples, differnt number of features, different model sizes (such as different tree numbers)
    • While the time efficiency has been improved, how's the accuracy of model predictions
  • I don't think SamplingExplainer can replace TreeExplainer for ensembling models
    • It requires a model input from train(), and cannot use loaded trained model
    • When there are 3000+ samples, TreeExplainer is still faster
  • Display shap plots on Databricks Notebooks
    • Set matplotlib=True so that you don't need to initiate JS. But on Databricks, this is not supported in Force Plot for multiple records now...
import matplotlib.pyplot as plt

exp_shap = shap.TreeExplainer(model)
shap_tree = exp_shap.shap_values(X_test)
expected_tree = exp_shap.expected_value
if isinstance(expected_tree, list):
  expected_tree = expected_tree[1]
print(f"Explainer expected value (Base Value): {expected_tree}")

idx = 10

print(f'Force Plot for #"{idx}" observation in test dataframe:')
## Option 1 - hide feature values
shap_force_plot = shap.force_plot(expected_tree, shap_tree[idx], feature_names=X_test_cols, matplotlib=True)
## Option 2 - show feature values
shap_force_plot = shap.force_plot(expected_tree, shap_tree[idx], X_test.iloc[idx], matplotlib=True)
display(shap_force_plot)

print(f'Decision Plot for #"{idx}" observation:')
print('Base Value:', expected_tree)
## Option 1 - hide feature values
shap_deci_plot = shap.decision_plot(expected_tree, shap_tree[idx], X_test.iloc[idx])
## Option 2 - show feature values
shap_deci_plot = shap.decision_plot(expected_tree, shap_tree[idx], feature_names=list(X_test.columns))
display(shap_deci_plot)
  • Apply SHAP on binary classifiers like LGBM, XGBoost, these classifiers are built on the log-odds scale and then just transformed to probabilities for predict_proba. So SHAP values are also in log odds units. A negative base value means you are more likely class 0 than 1, and the sum will equal the log-odds output of the model not the transformed probability after the logistic function. It you need you SHAP output probability scale, code as below, making sure you have data specified in TreeExplainer and model_output='probability', feature_dependence='independent' (complete example code):
explainer_real = shap.TreeExplainer(model_real, data=encoded_X_train, 
                                    model_output='probability', feature_dependence='independent')
shap_values_real = explainer_real(encoded_X_test)
expected_tree_real = explainer_real.expected_value
if isinstance(expected_tree_real, list):
    expected_tree_real = expected_tree_real[1]
print(f"Explainer expected value (Base Value): {expected_tree_real}")
  • NOTE: Above method may not work for CatBoost (till SHAP==0.43.0, catboost=1.2.2 this is still a bug), however, you can use np.exp(shap_value)/(1+np.exp(shap_value)) to convert SHAP's log odds output to probability format (only need shap.TreeExplainer(model)).