What Is Explainable AI?
The Interpretability Problem
A linear regression model with five features is fully transparent. You can read the coefficients and understand exactly how much each feature contributes to the output. A decision tree with 20 nodes can be visualized as a flowchart and followed step by step. These models are interpretable by design, but their simplicity limits their accuracy on complex tasks. A deep neural network with 175 billion parameters can achieve remarkable accuracy on tasks from medical diagnosis to language generation, but no human can comprehend how 175 billion interacting weights produce a specific output. The internal computation involves millions of matrix multiplications, nonlinear activations, and attention operations that collectively implement a function too complex for any person to trace.
This opacity matters whenever AI decisions have consequences. If a model denies a mortgage application, the applicant has a legal right to know why under the Equal Credit Opportunity Act. "The neural network's internal activations produced a score below the threshold" is not a legally sufficient explanation. If a model recommends a medical treatment, the physician needs to understand the reasoning to assess whether it is appropriate for this specific patient. If a model flags a person as a security risk, oversight bodies need to evaluate whether the flagging criteria are reasonable and non-discriminatory. In each case, the model's raw computation must be translated into human-understandable terms.
The accuracy-interpretability tradeoff is often presented as fundamental, but it is more nuanced than a simple dichotomy suggests. For many practical problems, inherently interpretable models like logistic regression, decision trees, and rule-based systems achieve accuracy within a few percentage points of black-box models. A 2019 study by Cynthia Rudin found that in many high-stakes domains, including criminal justice and healthcare, interpretable models matched or approached the accuracy of complex models. The accuracy gap that justifies black-box complexity is often smaller than commonly assumed, and the additional accuracy may not be worth the loss of interpretability in domains where understanding the reasoning is critical.
Inherently Interpretable Models
Linear models are the simplest form of interpretable AI. A linear regression or logistic regression model represents its decision as a weighted sum of input features. Each coefficient directly quantifies the feature's influence on the output: a positive coefficient for income in a credit model means higher income increases the approval probability, and the magnitude of the coefficient tells you by how much. These models are fully transparent but limited to linear relationships, which means they cannot capture the complex interactions and nonlinearities present in most real-world data without extensive feature engineering.
Decision trees partition the feature space into regions using a sequence of threshold tests. They can be visualized as flowcharts where each internal node tests a feature against a threshold, each branch represents an outcome of the test, and each leaf node contains a prediction. A shallow decision tree (5 to 10 levels) is easy for a non-expert to follow. The challenge is that individual decision trees tend to overfit, and the ensemble methods that address overfitting (random forests, gradient-boosted trees) sacrifice interpretability. A random forest of 500 trees, each with 30 levels, is no more interpretable than a neural network despite being built from individually interpretable components.
Generalized Additive Models (GAMs) represent a middle ground. They model the output as a sum of smooth functions of individual features, allowing nonlinear relationships while maintaining additivity (each feature's contribution can be examined independently). Explainable Boosting Machines (EBMs), developed by Microsoft Research, learn these smooth functions using gradient boosting and include limited pairwise interaction terms. EBMs achieve accuracy competitive with random forests and gradient-boosted trees on many tabular datasets while remaining interpretable: the contribution of each feature can be plotted as a curve, and the interaction between any pair of features can be visualized as a heatmap.
Post-Hoc Explanation Methods
LIME (Local Interpretable Model-agnostic Explanations), published in 2016, explains individual predictions by fitting a simple, interpretable model in the neighborhood of the prediction being explained. To explain why a classifier labeled an image as a "cat," LIME perturbs the image by masking different regions, observes how the classifier's output changes, and fits a linear model that approximates the classifier's behavior locally. The result is a set of image regions ranked by their importance to the prediction. For tabular data, LIME perturbs individual feature values and identifies which features most influence the prediction for a specific data point. The key insight is that even if the global model is too complex to explain, its behavior in the vicinity of any single prediction can often be approximated by a simple model.
SHAP (SHapley Additive exPlanations), published in 2017, uses game theory to assign each feature a contribution to the prediction. Shapley values, originally developed in cooperative game theory to fairly distribute payoffs among players, provide a theoretically principled way to attribute the difference between the model's prediction and the average prediction to individual features. For each feature, the Shapley value measures the average marginal contribution of that feature across all possible combinations of features. SHAP produces consistent, locally accurate explanations with desirable theoretical properties that LIME lacks. For tree-based models, TreeSHAP computes exact Shapley values efficiently. For neural networks, approximate methods like DeepSHAP and KernelSHAP provide tractable estimates.
Attention visualization reveals which parts of the input a transformer model focused on when generating each part of its output. In a vision transformer classifying a medical image, attention maps show which regions of the image the model attended to most strongly. In a language model, attention patterns show which input tokens influenced each generated token. These visualizations are intuitive and visually compelling, but their reliability as explanations is contested. Research has shown that attention weights do not always correlate with feature importance as measured by other methods, and that models with randomly shuffled attention weights sometimes produce similar outputs, suggesting that attention visualizations may overstate the causal role of the attended features.
Concept-based explanations operate at a higher level of abstraction. Instead of attributing predictions to raw input features (pixels, words), they attribute predictions to human-meaningful concepts (stripes, texture, fur color). Testing with Concept Activation Vectors (TCAV), developed at Google in 2018, measures how sensitive a model's predictions are to the presence of specific concepts. A user can ask "how important is the concept 'striped texture' to this model's classification of zebras?" and receive a quantitative answer. This aligns explanations more closely with how humans think about visual categories, making them more useful for non-technical stakeholders.
Mechanistic Interpretability
Mechanistic interpretability aims to reverse-engineer neural networks at the level of individual neurons, circuits, and computational mechanisms. Unlike post-hoc explanation methods that describe model behavior from the outside, mechanistic interpretability seeks to understand the internal representations and computations that produce that behavior. This is analogous to the difference between observing that a car goes faster when you press the accelerator (behavioral) and understanding the engine's combustion cycle, transmission, and drivetrain (mechanistic).
Research at Anthropic, Google DeepMind, and other organizations has identified specific circuits within neural networks that implement identifiable computational functions. In transformer language models, researchers have found induction heads (circuits that identify and continue repeating patterns), indirect object identification circuits (circuits that track which entity is the grammatical object of a verb), and factual association circuits (circuits that retrieve stored factual knowledge). These findings demonstrate that neural networks develop organized internal structure during training, not just a soup of undifferentiated parameters.
Sparse autoencoders, a technique that has gained significant traction since 2023, decompose a neural network's internal activations into interpretable features. By training a sparse autoencoder on a layer's activations, researchers can identify features that activate in response to specific semantic concepts, syntactic structures, or factual knowledge. Anthropic's work on Claude found features corresponding to concepts as specific as the Golden Gate Bridge, sycophantic behavior, and code vulnerability detection. These features provide a vocabulary for describing what the model is "thinking about" at each step of processing, moving toward genuine interpretability rather than post-hoc approximation.
The practical significance of mechanistic interpretability extends beyond academic understanding. If you can identify the specific circuits responsible for undesirable behavior, like generating toxic content or perpetuating stereotypes, you can potentially intervene at the circuit level rather than relying on blunt approaches like output filtering. This surgical precision in safety interventions represents a qualitative advance over current approaches, though the field is still early and most findings apply to models much smaller than production-scale systems.
Limitations and Open Challenges
All current explanation methods have significant limitations. Post-hoc explanations are approximations that may not faithfully represent the model's actual reasoning. A SHAP explanation identifies which features were important but not how those features interacted within the model. A LIME explanation is valid only in a local neighborhood and may be misleading if extrapolated. Attention visualizations show correlation between attention and output but do not prove causation. Users who trust these explanations uncritically may develop false confidence in their understanding of the model.
Explanation methods can also be gamed. Researchers have demonstrated that it is possible to build models that produce fair-looking explanations while making biased decisions. A model can learn to appear to rely on legitimate features (income, credit history) while actually using protected characteristics (race, gender) through hidden correlations. If regulators rely on explanation methods to verify fairness, adversarial model designers can exploit the gap between the explanation and the actual decision process. This highlights the importance of combining explanation methods with independent auditing and outcome testing.
The audience for explanations matters enormously. A data scientist, a physician, a judge, a loan applicant, and a regulator all need different types of explanations at different levels of technical detail. An explanation that satisfies a machine learning researcher may be incomprehensible to the person affected by the decision. Conversely, a simplified explanation accessible to a non-expert may omit critical nuances that a technical auditor needs. Effective XAI requires not just technical methods but careful design of the communication layer that translates technical explanations into forms appropriate for each audience.
Explainable AI bridges the gap between model accuracy and human understanding through techniques ranging from inherently interpretable models to post-hoc methods like SHAP and LIME to emerging mechanistic interpretability research. No single technique provides complete, faithful explanations, and effective deployment requires matching the explanation method to both the model type and the audience's needs.