🔍 Interpretability & Analysis of LMs
Outstanding research in LM interpretability and evaluation, summarized
Paper • 2501.08319 • Published • 10Note The authors improve LM-generated descriptions for automatic interpretability by accounting for the effect of each feature on the LM output, either via vocabulary projections or direct probability comparison. Results show that such approaches are less expensive than common input-based autointerp methods, and can be integrated with those to improve the quality of feature descriptions.
Open Problems in Machine Unlearning for AI Safety
Paper • 2501.04952 • Published • 1Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models
Paper • 2412.16247 • Published • 1
Inferring Functionality of Attention Heads from their Parameters
Paper • 2412.11965 • Published • 2Note Thread: https://bsky.app/profile/megamor2.bsky.social/post/3ldlwiz42j22c
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Paper • 2412.08686 • Published • 1Note Authors train a model to answer open-ended questions about model activations in natural language. The approach can be used for reading properties of LM activations, or steering the generation by directly optimizing the latents to closely match a natural language prompt. Thread: https://x.com/aypan_17/status/1867626789847413119
Training Large Language Models to Reason in a Continuous Latent Space
Paper • 2412.06769 • Published • 75Note Chain of Continuous Thought (CoConuT) uses terminal states as starting point for subsequent generation steps, effectively preserving more information rather than committing to a single output token. Answers are then generated in natural language. The method is shown to outperform CoT and its variants on mathematical reasoning and related tasks. Thread: https://x.com/Ber18791531/status/1866561188664087017
Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models
Paper • 2412.05353 • Published • 1Note Thread: https://bsky.app/profile/michaelwhanna.bsky.social/post/3ldnyqfcrzc2f
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
Paper • 2412.04619 • Published • 1Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Paper • 2411.14257 • Published • 9Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Paper • 2411.12580 • Published • 2Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
Paper • 2411.10397 • Published • 1
Controllable Context Sensitivity and the Knob Behind It
Paper • 2411.07404 • Published • 1Note Thread: https://x.com/jkminder/status/1856671617029398952 Code: https://github.com/kdu4108/context-vs-prior-finetuning
Counterfactual Generation from Language Models
Paper • 2411.07180 • Published • 5
The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units
Paper • 2411.02280 • Published • 1Note Thread: https://bsky.app/profile/bkhmsi.bsky.social/post/3ldo5jnpgm22m
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Paper • 2410.21272 • Published • 1Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Paper • 2410.20526 • Published • 1The Geometry of Concepts: Sparse Autoencoder Feature Structure
Paper • 2410.19750 • Published • 2Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Paper • 2410.15999 • Published • 19Decomposing The Dark Matter of Sparse Autoencoders
Paper • 2410.14670 • Published • 1How Do Multilingual Models Remember? Investigating Multilingual Factual Recall Mechanisms
Paper • 2410.14387 • Published • 1Automatically Interpreting Millions of Features in Large Language Models
Paper • 2410.13928 • Published • 1Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Paper • 2410.11179 • Published • 1
Towards Interpreting Visual Information Processing in Vision-Language Models
Paper • 2410.07149 • Published • 1Note Authors apply ablation techniques to popular VLM models, showing that token position corresponding to input image elements preserve most of information from the corresponding locations. Logit lens is also used to show that visual tokens tend naturally towards meaningful word embeddings despite no explicit conditioning. Late layers of the LLM are found responsible for the extraction of visual information from visual tokens.
Geometric Signatures of Compositionality Across a Language Model's Lifetime
Paper • 2410.01444 • Published • 1Note Authors show that LMs encode form complexity linearly and meaning complexity nonlinearly. Form complexity correlates near-perfectly to linear embedding dimension (PCA) of representations since the start of training. Results suggest LMs encode form and meaning differently: form is an inductive bias represented in high-dimensional linear subspaces, while meaning is learned over time in low-dimensional nonlinear manifolds. Thread: https://x.com/sparse_emcheng/status/1842835883726078311
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Paper • 2409.14507 • Published • 1Residual Stream Analysis with Multi-Layer SAEs
Paper • 2409.04185 • PublishedContextCite: Attributing Model Generation to Context
Paper • 2409.00729 • Published • 14Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Paper • 2408.06663 • Published • 16Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Paper • 2408.05147 • Published • 39Transformer Explainer: Interactive Learning of Text-Generative Models
Paper • 2408.04619 • Published • 156The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Paper • 2408.01416 • Published • 1
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Paper • 2408.00113 • Published • 7Note Currently SAEs are evaluated via unsupervised metrics like L0. This work proposes using a constrained setting like board games to fix a predefined set of known features (i.e. board states & their properties), and use SAE performances in recovering those as a quality metric. p-annealing is proposed to transition from L0 to L1 during training, improving SAE quality. Thread: https://x.com/a_karvonen/status/1819399813441663042
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Paper • 2407.14494 • Published • 1Note This work introduces Strict IIT (SIIT) to intervene in low-level NN nodes while preserving high-level task coherence. Authors train several LMs with SIIT to follow circuits of interest, producing models that can be used as benchmarks for component attribution methods while remaining realistic, as opposed to Tracr-compiled RASP models. Experiments on various methods show that EAP-IG produces results on par with ACDC. Thread: https://x.com/IvanArcus/status/1815412941677850949
LLM Circuit Analyses Are Consistent Across Training and Scale
Paper • 2407.10827 • Published • 4Note This paper examines how language model circuits and components develop throughout pre-training, studying models from 70M -> 2.8B params across 300B training tokens. Task performance and functional components emerge at similar token counts across model scales, with the overall algorithmic structure of circuits remaining roughly stable across training. This suggests a mechanistic understanding of early checkpoints and smaller models could improve MI research efficiency.
On Evaluating Explanation Utility for Human-AI Decision Making in NLP
Paper • 2407.03545 • Published • 1
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Paper • 2406.20086 • Published • 5Note This work hypothesizes the existence of an implicit vocabulary in autoregressive LMs mapping multi-tokens to word-like embeddings in early layers. A token information erasure phenomenon is discovered for multi-word expressions and used as a signal to retrieve possible implicit vocabulary items for Llama 2 and Llama 3.
Multi-property Steering of Large Language Models with Dynamic Activation Composition
Paper • 2406.17563 • Published • 4Note This work studies optimal strategies for multi-property activation steering of LLMs to combine language conditioning with stylistic properties like safety and formality. Steering intensity is shown to be property-dependent, with a trade-off between steering accuracy and output fluency. Dynamic Activation Composition is proposed to adaptively modulate the steering intensity from the expected shift in the LLM predictive distribution, ensuring good steering while minimizing fluency degradation.
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation
Paper • 2406.16320 • Published • 2
Confidence Regulation Neurons in Language Models
Paper • 2406.16254 • Published • 10Note This work focuses on neuron-level mechanisms for confidence calibration in LLMs, identifying entropy and token frequency neurons. Entropy neurons have high weight norms but minimal direct impact on logits, effectively modulating output distribution entropy. Token frequency neurons boost logits proportionally to their frequency. Experiments demonstrate the interaction between entropy neurons and other mechanisms, such as induction heads.
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
Paper • 2406.13663 • Published • 7Note This work proposes to use LLM internals to create RAG citations, ensuring citations reflect actual context usage during generation rather than surface-level similarity. The proposed method MIRAGE (Model Internals-based RAG Explanations) detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. The method outperforms 11B NLI validators without model/data requirements, and is more robust and faithful than self-citation.
Estimating Knowledge in Large Language Models Without Generating a Single Token
Paper • 2406.12673 • Published • 7Note This work aims to test whether model knowledge about an entity can be estimated only from its internal computation. In particular, model internals are used to estimate the models' confidence in QA about the entity, and the factuality of responses. Experiments show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks. Moreover, KEEN aligns with the model's hedging behavior and faithfully reflects changes in the model's knowledge after fine-tuning.
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
Paper • 2406.12618 • Published • 5Note Interpretability and analysis (IA) of LMs is often criticized for lacking actionable insights towards advancing NLP. This study analyzes a citation graph of 185K+ papers from ACL/EMNLP, and surveys 138 NLP researchers about their views on IA. IA work is well-cited, with many works building on IA findings. Authors call IA researchers to center humans in their work, think about the big picture, provide actionable insights and work towards standardized, robust methods.
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
Paper • 2406.09519 • Published • 1Note LLMs struggle to reliably recall items from a list as the length of the list increases, depending on the items' positions in it. This work shows that models' prompt sensitivity is caused by failures in information communication across model layers using "communication channels", i.e. subspaces of the residual stream. A new SVD-based procedure is proposed to identify these channels from weight matrices.
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
Paper • 2406.00053 • Published • 1Note This work focuses on structural ICL, i.e., extrapolation to unseen tokens within familiar structures. The authors find that structural ICL is transient (i.e., it decays over training) in both natural and synthetic settings. Active forgetting can maintain structural ICL in the synthetic setting. Temporary forgetting is introduced, enabling a dual process strategy where models use in-weight solutions for frequent tokens, and structural ICL solutions for rare ones.
Calibrating Reasoning in Language Models with Internal Consistency
Paper • 2405.18711 • Published • 6Note CoT reasoning leads to inconsistencies between models' middle and final layers, suggesting some uncertainty. Authors propose to use the internal consistency (IC) of model predictions produced from intermediate layers as a measure of the model's confidence in the reasoning process. IC is found to effectively distinguishes between correct and incorrect reasoning paths, and up-weighting reasoning paths with high internal consistency leads to significant improvements in reasoning performance.
From Neurons to Neutrons: A Case Study in Interpretability
Paper • 2405.17425 • Published • 2
Emergence of a High-Dimensional Abstraction Phase in Language Transformers
Paper • 2405.15471 • Published • 2Note This work shows that representation processing in Transformer-based LMs undergoes a phase transition characterized by an expansion of intrinsic dimensionality, cross-model information sharing, and a switch to abstract information processing. This dynamic is reduced in the presence of random text, and absent in untrained model. The depth at which such dynamic appears is found to empirically correlate with model quality.
Not All Language Model Features Are Linear
Paper • 2405.14860 • Published • 39Note While some features in Transformer LMs are represented linearly in a single dimension, authors find evidence of multidimensional features that can be extracted using techniques such as SAEs. In particular, authors find evidence for circular representations of time-related concepts such as days, months and years in the representations of multiple LMs.
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Paper • 2405.12522 • Published • 2
Your Transformer is Secretly Linear
Paper • 2405.12250 • Published • 151Note This work shows near-linear behavior across various LLM architecture, showing how pre-training tends to increase nonlinear dynamics while fine-tuning can make them more pronounced. Authors speculate that, as embeddings become more uniform across neighboring layers, models may compensate for the reduced variability by amplifying non-linear processing in the residual stream.
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Paper • 2405.10928 • Published • 1Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Paper • 2405.10927 • Published • 3Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Paper • 2405.08366 • Published • 2Learned feature representations are biased by complexity, learning order, position, and more
Paper • 2405.05847 • Published • 2Interpretability Needs a New Paradigm
Paper • 2405.05386 • Published • 3
A Primer on the Inner Workings of Transformer-based Language Models
Paper • 2405.00208 • Published • 9Note This work summarizes recent trends in interpretability research for Transformer-based LMs, presenting model components, popular methods, and findings using a joint notation. Approaches are categorized based on their usage for either behavior localization or information decoding, and a section is reserved to present popular tools for conducting interpretability research.
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Paper • 2404.07129 • Published • 3Note This study introduces "clamping," a method inspired by optogenetics, for training-time causal interventions to study mechanism formation. Authors apply clamping to study the emergence of induction heads (IH), finding IHs contribs are additive and redundant, with competition emerging due to optimization pressures, and many-to-many dependent on previous token heads in lower layers. Three critical induction subcircuits are identified, and their formation is connected to data-dependent properties.
LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models
Paper • 2404.07004 • Published • 6Note The LLM transparency toolkit is an open-source toolkit and visual interface to efficiently identify component circuits in LMs responsible for their predictions using Information Flow Routes. The tool can highlight the importance of granular components, and vocabulary projections are provided to examine intermediate predictions of the residual stream, and tokens promoted by specific component updates.
Does Transformer Interpretability Transfer to RNNs?
Paper • 2404.05971 • Published • 3Note This work applies contrastive activation addition, the tuned lens and probing for eliciting latent knowledge in quirky models to Mamba and RWKV LMs, finding these Transformer-specific methods can be applied with slight adaptation to these architectures, obtaining similar results.
Context versus Prior Knowledge in Language Models
Paper • 2404.04633 • Published • 5Note This work examines the influence of context versus memorized knowledge in LMs through the lens of the shift caused by contexts at various degrees of informativeness to the models' predictive distribution. Authors propose information-theoretic metrics to measure the persuasiveness of a context and the susceptibility of an entity to be influenced by contextual information. Analysis reveals important differences due to model size, query formulation and context assertiveness/negation.
Locating and Editing Factual Associations in Mamba
Paper • 2404.03646 • Published • 3Note This work applies the ROME method to Mamba, finding weights playing the role of MLPs in encoding factual relations across several Mamba layers, and can be patched to perform model editing. A new SSM-specific technique is also introduced to emulate attention knockout (value zeroing) revealing information flows similar to the ones in Transformers when processing factual statements.
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 92Note This work introduces Representation fine-tuning (ReFT), a framework using learned inference-time interventions as efficient yet effective alternatives to PEFT weight adaptation. LoReFT, a ReFT variant intervening linearly on a representation subspaces, is evaluated against several PEFT approaches showing SOTA performances across popular benchmark with 10-50x speedup. The HF-compatible pyreft library is introduced to simplify ReFT usage.
Do language models plan ahead for future tokens?
Paper • 2404.00859 • Published • 2Note This work aims to evaluate whether language models exhibit implicit planning during generation. In a synthetic setting and employing a myopic variant of gradient descent ignoring off-diagonal information, authors find that LMs can implicitly plan for future predictions. However, the same behavior is observed to a much lesser extent for natural language, where computation for current predictions are also functional to future results.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Paper • 2403.19647 • Published • 3Note This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting in behaviors like predicting sequence increments.
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Paper • 2403.17806 • Published • 3Note This work introduces an efficient approach for circuit discovery, EAP-IG, using integrated gradients to perform edge attribution patching. Circuits found by EAP-IG and other methods are evaluated in terms of faithfulness, i.e. consistency of pre- and post-patching behavior, finding EAP-IG outperforms EAP across all tested tasks. The overlap between circuits found by activation patching and EAP is a faithfulness indicator only when full or no overlap is present, but not for partial overlap cases
Information Flow Routes: Automatically Interpreting Language Models at Scale
Paper • 2403.00824 • Published • 3Note This work proposes an efficient approach for circuit discovery. Information flow routes require a single forward pass, and are derived from decomposing component updates into the Transformer residual stream. Experiments on LLaMA 7B show how the contrastive formulation of activation patching (which can be avoided with Information flow routes) can lead to misleading results depending to selected templates.
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Paper • 2403.00745 • Published • 13Note Authors identify two failure modes for the attribution patching (AtP) method for estimating component importance in LMs, leading to false negatives due to attention saturation or cancellation of direct and indirect effects. An improved version named AtP* is proposed to improve the method’s robustness in such settings. A diagnostic procedure is also proposed to bound the error caused by gradient approximation.
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Paper • 2402.17700 • Published • 2
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Paper • 2402.14811 • Published • 4Note This work investigates the effect of fine-tuning on circuit-level mechanisms in LLMs, focusing on the entity tracking task on LLaMA 7B variants. Authors find that circuits from the base model persist in fine-tuned models, and their individual components preserve their functionalities. Cross-Model Activation Patching (CMAP) reveals that gains in performance can be attributed to improvements in circuit components, rather than overall functional rearrangement.
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
Paper • 2402.13331 • Published • 2Note This work proposes Simple Detectors Aggregation (STARE), an aggregation procedure to leverage hallucination detectors’ complementary strengths in the context of machine translation. Authors experiment with two popular hallucination detection benchmarks (LFAN-HALL and HalOmi), showing that an aggregation of detectors using only model internals can outperform ad-hoc trained hallucination detectors.
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Paper • 2402.12865 • Published • 1Note This work extends Logit Lens vocabulary projections of FFNs in Transformers to gradients to study the knowledge editing performed by backward passes. Authors prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes’ inputs, and identify an imprint-and-shift mechanism driving knowledge updating in FFNs. Finally, an efficient editing method driven by the linearization above is evaluated, showing strong performances in simple editing settings.
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Paper • 2402.12560 • Published • 3Note The authors introduce a revisited psycholinguistic benchmark to evaluate the effectiveness and reliability of intervention-based mechanistic interpretability methods across several linguistic tasks. Across several model sizes, Distributed alignment search (DAS) and probing are found to be the most reliable approaches, and are used to investigate the emergence of features linked to linguistically plausible predictions in the initial phases of model training.
In-Context Learning Demonstration Selection via Influence Analysis
Paper • 2402.11750 • Published • 2Note This work introduces InfICL, a demonstration selection method using influence functions to identify salient training examples to use as demonstrations at inference time. InfICL is tested alongside other examples selection baselines for prompting medium-sized LLMs for COLA and RTE, showing improvements over other methods especially when a smaller number of in-context examples is used.
Recovering the Pre-Fine-Tuning Weights of Generative Models
Paper • 2402.10208 • Published • 7Note This paper introduces SpectralDeTuning, an method to recover original pre-trained weights of a model from a set of LoRA fine-tunes with merged weights. Authors introduce the LoWRA Bench dataset to measure progress in this task, and show that the method performs well for both language and vision models. The current limitations of the approach are 1) assuming the attacker knowledge of the rank used in LoRAs and 2) need for a good amount of LoRAs to reconstruct the original pre-training effectively
SyntaxShap: Syntax-aware Explainability Method for Text Generation
Paper • 2402.09259 • Published • 2Note Authors propose SyntaxSHAP, a variant of the model-agnostic SHAP approach enforcing tree-based coalition based on the syntax of the explained sequence, while preserving most properties of SHAP explanations. The approach is found to be more faithful and semantically meaningful than other model-agnostic methods when explaining the predictions of LMs such as GPT-2 and Mistral, especially in edge cases such as negation.
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models
Paper • 2402.07543 • Published • 2Note Authors propose a fine-tuning procedure with natural language explanation to clarify intermediate reasoning steps. Several LMs are fine-tuned on ListOps dataset, containing synthetically-generated instructions on sequences of numbers. Authors find that explanations improve model performances across all tested model sizes and explanations lengths. Smaller language models benefit the most from explanations, especially when long-form.
Model Editing with Canonical Examples
Paper • 2402.06155 • Published • 12Note This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. The approach is tested on regular LMs and Backpack LMs, which are more controllable thanks to disentangled sense vector representations. For the latters, authors propose sense fine-tuning, i.e. updating few sense vectors with canonical examples to apply desired changes in an efficient and effective way, outperforming other model editing approaches and full/LoRa fine-tuning.
AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers
Paper • 2402.05602 • Published • 4Note This work proposes extending the LRP feature attribution framework to handling Transformers-specific layers. Authors show that AttnLRP is significantly more faithful than other popular attribution methods, has minimal time requirements for execution and can be employed to identify model components associated to specific concepts in generated text.
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
Paper • 2402.04614 • Published • 3Note This work discusses the dichotomy between faithfulness and plausibility in LLMs’ self-explanations (SEs) employing natural language (CoT, counterfactual reasoning, and token importance), which tend to be plausible but unfaithful to models' reasoning process. Authors call for a community effort to 1) develop reliable metrics to characterize the faithfulness of explanations and 2) pioneering novel strategies to generate more faithful SEs.
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
Paper • 2402.03744 • Published • 4Note While most internals-based hallucination detection methods use surface-level information, this work proposes EigenScore, an internal measure of responses’ self-consistency using the eigenvalues of sampled responses' covariance matrix in intermediate model layers to quantify answers’ diversity in the dense embedding space. EigenScore outperforms logit-level methods for hallucination detection on QA tasks, especially with feature clipping to control overconfident generations.
Rethinking Interpretability in the Era of Large Language Models
Paper • 2402.01761 • Published • 23Note In this opinion piece, authors contend that the new capabilities of LLMs can transform the scope of interpretability, moving from low-level explanations such as saliency maps and circuit analysis to natural language explanations. This goal is hindered by LM’s natural tendency to hallucinate, their large size and their inherent opaqueness. Authors highlight in particular dataset explanations for knowledge discovery, explanations’ reliability and interactive explanations as key areas moving ahead.
ReAGent: Towards A Model-agnostic Feature Attribution Method for Generative Language Models
Paper • 2402.00794 • Published • 1Note Authors propose Recursive Attribution Generation (ReAGent), a perturbation-based feature attribution approach specifically conceived for generative LMs. The method employs a lightweight encoder LM to replace sampled input spans with valid alternatives and measure the effect of the perturbation on the drop in next token probability predictions. ReAGent is shown to consistentlyoutperform other established approaches across several models and generation tasks in terms of faithfulness.
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Paper • 2402.00559 • Published • 3Note This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs. In their analysis, the authors find that LM-produced CoTs generally contain faulty steps often leading to wrong automatic verification.
Gradient-Based Language Model Red Teaming
Paper • 2401.16656 • Published • 1Note This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses. In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt. GBRT prompts are shown to be more likely to generate unsafe responses and evade safety-tuning measures than those produced by RL-based methods.
Black-Box Access is Insufficient for Rigorous AI Audits
Paper • 2401.14446 • Published • 3Note Audits conducted on AI systems can identify potential risks and ensure their compliance to safety requirements. Authors categorise audits based on the access to model-related resources and highlight how levels of transparency on audited AI system enable broader and more effective auditing procedures. Technical, physical, and legal safeguards for performing audits are also introduced to ensure minimal security risks for audited companies.
The Calibration Gap between Model and Human Confidence in Large Language Models
Paper • 2401.13835 • Published • 4Note This work evaluates the human confidence in LLM responses to multiple-choice MMLU questions based on explanations the LLM provides together with selected answers. The authors experiment with altering the model prompt to reflect the actual prediction confidence in models’ explanations, showing improved calibration for users’ assessment of LLM’s reliability and a better ability to discriminate between correct and incorrect answers.
In-Context Language Learning: Architectures and Algorithms
Paper • 2401.12973 • Published • 4Note This work methodically evaluates of in-context learning on formal languages across several model architectures, showing how Transformers work best in this setting. These results are attributed to the presence of “n-gram heads” able to retrieve the token following a context already seen in the current context window and copy it. These insights are used to design static attention layers mimicking the behavior of n-gram head, leading to lower perplexity despite the lower computational cost.
From Understanding to Utilization: A Survey on Explainability for Large Language Models
Paper • 2401.12874 • Published • 4Note This survey summarizes recent works in interpretability research, focusing mainly on pre-trained Transformer-based LMs. The authors categorize current approaches as either local or global and discuss popular applications of LM interpretability, such as model editing, enhancing model performance, and controlling LM generation.
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools
Paper • 2401.12576 • Published • 2Note Authors introduce LLMCheckup, a conversational interface connecting an LLM to several interpretability tools (feature attribution methods, similarity, counterfactual/rationale generation) allowing users to inquire about LLM predictions using natural language. The interface consolidates several interpretability methods in a unified chat interface, simplifying future investigations into natural language explanations.
Universal Neurons in GPT2 Language Models
Paper • 2401.12181 • Published • 5Note This work investigates the universality of individual neurons across GPT2 models trained from different initial random seeds, starting from the assumption that such neurons are likely to exhibit interpretable patterns. 1-5% of neurons consistently activate for the same inputs, and can be grouped into families exhibiting similar functional roles, e.g. modulating prediction entropy, deactivating attention heads, and promoting/suppressing elements of the vocabulary in the prediction.
Can Large Language Models Explain Themselves?
Paper • 2401.07927 • Published • 4Note This study uses self-consistency checks to measure the faithfulness of LLM explanations: if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. Results demonstrate that LLM self-explanations faithfulness of self-explanations cannot be reliably trusted, as they prove to be very task and model dependent, with bigger model generally producing more faithful explanations.
Fine-grained Hallucination Detection and Editing for Language Models
Paper • 2401.06855 • Published • 4Note Authors introduce a new taxonomy for fine-grained annotation of hallucinations in LM generations and propose Factuality Verification with Augmented Knowledge (FAVA), a retrieval-augmented LM fine-tuned to detect and edit hallucinations in LM outputs, outperforming ChatGPT and LLama2 Chat on both detection and editing tasks.
Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models
Paper • 2401.06102 • Published • 21Note Patchscopes is a generalized framework for verbalizing information contained in LM representations. This is achieved via a mid-forward patching operation inserting the information into an ad-hoc prompt aimed at eliciting model knowledge. Patchscope instances for vocabulary projection, feature extraction and entity resolution in model representation are show to outperform popular interpretability approaches, often resulting in more robust and expressive information.
Model Editing Can Hurt General Abilities of Large Language Models
Paper • 2401.04700 • Published • 3Note This work raises concerns that gains in factual knowledge after model editing can result in a significant degradation of the general abilities of LLMs. Authors evaluate 4 popular editing methods on 2 LLMs across eight representative tasks, showing model editing does substantially hurt model general abilities.
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Paper • 2309.16042 • Published • 3Note This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' provide several recommendations concerning the type of patching (noise vs. counterfactual), the metric to use (probability vs. logit vs. KL), the number of layers to patch and which tokens to corrupt.