Understanding Shapley Values in R
Shapley Values can be a really useful tool in data analysis and machine learning. Recently, I came across them and found them interesting. Here is an introduction and a step-by-step guide on how to use them in R.
In the world of machine learning and explainable AI (XAI), understanding model predictions is just as crucial as making accurate forecasts.
One powerful method for feature attribution is Shapley Values, which originates from cooperative game theory.
Shapley Values help us determine how much each feature contributes to a model's prediction, making them widely used in finance, healthcare, and actuarial science.
In this post, I’ll explore how to compute and interpret Shapley Values in R using the iml
package.
What Are Shapley Values?
Shapley Values, introduced by Lloyd Shapley in 1953, provide a fair way to distribute an outcome among multiple contributors.
In machine learning, they allocate the prediction difference to each feature by averaging its impact across all possible feature combinations.
Shapley Values provide a theoretically sound way to interpret model decisions, making them particularly useful for:
Feature importance analysis: Understanding which features drive predictions.
Fairness: Detecting bias in decision-making models.
Regulatory compliance: Meeting explainability requirements in fields like finance and healthcare.
Unlike simpler feature importance methods (e.g., correlation or coefficients in linear models), Shapley Values account for feature interactions, offering more reliable explanations.
Implementing Shapley Values in R
Let’s look at how to use, and interpret Shapley Values in R, creating a simple machine learning model.
Step 1: Install and Load Required Packages
To compute Shapley Values, we use the iml
package, which provides tools for model interpretation in R.
Install packages, if needed, and then load them:
install.packages("iml") install.packages("randomForest") install.packages("MASS") library(iml) library(randomForest) library(MASS)
Step 2: Train a Model
We'll use the Boston housing dataset, which predicts median house prices based on various features.
To look at how this dataset looks like, you can just write ‘Boston’ in your R consol and click ‘Run’.
Boston
Set the seed to have the same results later. If not set, the model will be estimated differently, therefore, a recreation of your result won’t be possible.
Our model is a Random Forest model, with 100 trees (ntree parameter).
set.seed(42) rf_model <- randomForest(medv ~ .,data = Boston, ntree = 100)
Step 3: Compute Shapley Values
Now, create an iml::Predictor
object and calculate Shapley Values for a sample observation.
predictor <- Predictor$new(rf_model, data = Boston[, -14], y = Boston$medv) observation <- Boston[1, -14] shapley <- Shapley$new(predictor, x.interest = observation) print(shapley)
Results show predicted value, which is the median house price in the Boston area, and average prediction.
At the Botton, a table shows values of phi, which tell us, how big an impact the predicted value, has on a given feature.
The bigger the value, the bigger the impact on the final prediction.
Step 4: Visualize Shapley Values
A plot helps interpret feature contributions values of phi.
plot(shapley)
While performing a project, we can use a second method for feature importance analysis and then select the best features.
Based solely on Shapley Values, the lstat and indus variables have the biggest impact on the predicted median house price, while the chas and zn variables have the lowest impact.
Conclusion
Shapley Values provide an effective way to explain complex models, offering insights into feature importance while considering feature interactions.
Using the iml
package in R, we can compute and visualize Shapley Values to better understand model decisions.
By integrating these methods into your workflow, you can build trustworthy, transparent, and fair machine-learning models.
Try experimenting with different datasets and models to see how feature contributions vary!
Additionally, if you want to learn more about Shapley Values, I include a couple of links at the bottom for reference. Enjoy!
shapley package - RDocumentation -https://www.rdocumentation.org/packages/shapley/versions/0.1
A gentle introduction to SHAP values in R - https://www.r-bloggers.com/2019/03/a-gentle-introduction-to-shap-values-in-r/