Elian Angius

ML Debuggability

Instruments tree-based models with learn-time metadata & run-time signals to diagnose why any prediction went wrong.

  • MLOps
  • Explainability
  • Tree Models

Overview

When a deployed tree-based model (decision tree, random forest, boosted trees, ..) produces a bad prediction, an engineer needs to know what to do about it — ship a code fix, expensive re-train, design new features, or accept it as a domain edge case & move on. Feature vectors, model scores & raw inputs alone don’t say which. This project is a model-debugging toolkit, pioneered before SHAP & perturbation-based explainability techniques became mainstream, which opened up the black box of any tree ensemble so anyone can triage a bad prediction’s root cause.

The Challenge

  • A bad prediction can stem from several distinct root causes, each demanding a different remedy:
    • a programming bug (exception, edge case, faulty implementation)
    • a model blind spot (a missing feature, unsampled training data)
    • a domain outlier (a genuinely hard or ambiguous case)
    • distribution shift (inputs have changed since training)
    • label shift (expectations of “correct” has changed)
  • Feature vectors, probability scores & raw data alone can’t distinguish between these.
  • Had to remain agnostic to model type, hyperparameters, feature space & domain — usable on any tree ensemble, for any use case
  • No existing way to compare what a model saw during training against what it encounters at inference time

Approach

  • At learn time, instrumented the trained forest with metadata:
    • hyperparameters used, number of training examples, & classification breakdown
    • feature importance against the test set, & the fraction of available features actually used
    • per-tree & per-node stats — shortest/longest/average path length, & the distribution of feature values & training-example counts seen at each node
  • At run time, traversed each datapoint through the forest & contrasted its behavior against that learn-time metadata to derive 3 signals:
    • Path length — how deep a datapoint travels relative to the forest’s average, proxying how novel or difficult the case is (an unusually shallow path can signal incomplete learning; an unusually deep one can signal an outlier or an overfit model)
    • Result confidence — cumulative tree-vote confidence rather than a simple majority vote, flagging predictions likely to flip-flop on retraining
    • Feature frequency — how often each feature fires along a datapoint’s paths, normalized by how often that feature appears in the forest overall, surfacing which features actually drove (or failed to drive) the decision
  • Packaged these signals into a debug object attached to every prediction & surfaced them in an in-house review UI

Results

  • Let engineers triage a bad prediction into a code fix, feature engineering, a retrain, or an accepted domain edge case — without re-running experiments
  • Independently arrived at ideas later popularized by SHAP & perturbation-based explainability methods
  • The review UI doubled as a labeling tool, feeding flagged examples back as labeled data for the next training generation
  • Model & domain agnostic — the same framework applies wherever a tree ensemble produces a classification