Submitted:
20 January 2026
Posted:
21 January 2026
You are already at the latest version
Abstract
Keywords:
Contents
| 1. | Introduction | 4 |
| 2. | Core Interpretable Objects of LLMs | 5 |
| 2.1. Token Embedding................................................................................................................... | 5 | |
| 2.2. Transformer Block and Residual Stream............................................................................................. | 6 | |
| 2.3. Multi-Head Attention (MHA)........................................................................................................ | 7 | |
| 2.4. Feed-Forward Network (FFN)........................................................................................................ | 8 | |
| 2.5. Sparse Autoencoder (SAE) Feature.................................................................................................. | 8 | |
| 3. | Localizing Methods | 10 |
| 3.1. Magnitude Analysis................................................................................................................ | 10 | |
| 3.2. Causal Attribution................................................................................................................ | 13 | |
| 3.3. Gradient Detection................................................................................................................ | 14 | |
| 3.4. Probing........................................................................................................................... | 17 | |
| 3.5. Vocabulary Projection............................................................................................................. | 19 | |
| 3.6. Circuit Discovery................................................................................................................. | 10 | |
| 4. | Steering Methods | 23 |
| 4.1. Amplitude Manipulation............................................................................................................ | 23 | |
| 4.2. Targeted Optimization............................................................................................................. | 25 | |
| 4.3. Vector Arithmetic................................................................................................................. | 27 | |
| 5. | Applications | 29 |
| 5.1. Improve Alignment................................................................................................................. | 29 | |
| 5.1.1. Safety and Reliability.............................................................................................. | 29 | |
| 5.1.2. Fairness and Bias................................................................................................... | 31 | |
| 5.1.3. Persona and Role.................................................................................................... | 33 | |
| 5.2. Improve Capability................................................................................................................ | 35 | |
| 5.2.1. Multilingualism..................................................................................................... | 35 | |
| 5.2.2. Knowledge Management................................................................................................ | 37 | |
| 5.2.3. Logic and Reasoning................................................................................................. | 39 | |
| 5.3. Improve Efficiency................................................................................................................ | 41 | |
| 5.3.1. Efficient Training.................................................................................................. | 41 | |
| 5.3.2. Efficient Inference................................................................................................. | 43 | |
| 6. | Challenges and Future Directions | 44 |
| 7. | Conclusions | 46 |
| A. | Summary of Surveyed Papers | 47 |
| B. | References | 52 |
Paper Outline


1. Introduction
- 1) A Rigorous Pipeline-Driven Framework: We establish a structured framework for applying MI to real-world model optimization. We begin by defining the core Interpretable Objects within LLMs (e.g., neurons, attention heads, residual streams). Based on the application workflow, we clearly categorize methodologies into two distinct stages: Localizing (Diagnosis), which identifies the causal components responsible for specific behaviors, and Steering (Intervention), which actively manipulates these components to alter model outputs. Crucially, for each technique, we provide a detailed Methodological Formulation along with its Applicable Objects and Scope, helping readers quickly understand the technical implementation and appropriate use cases.
- 2) Comprehensive Paradigms for Application: We provide an extensive survey of MI applications organized around three major themes: Improve Alignment, Improve Capability, and Improve Efficiency. These themes cover eight specific scenarios, ranging from safety and multilingualism to efficient training. Instead of merely listing relevant papers, we summarize representative MI application paradigms for each scenario. This approach allows readers to quickly capture the distinct usage patterns of MI techniques across different application contexts, facilitating the transfer of methods to new problems.
- 3) Insights, Resources, and Future Directions: We critically discuss the current challenges in actionable MI research and outline promising future directions. To facilitate further progress and lower the barrier to entry, we curate a comprehensive collection of over 200 papers, which are listed in Table 2. These papers are systematically tagged according to their corresponding localizing and steering methods, providing a practical and navigable reference for the community.
2. Core Interpretable Objects of LLMs
2.1. Token Embedding
2.2. Transformer Block and Residual Stream
2.3. Multi-Head Attention (MHA)
Standard Formulation
Mechanistic View: QK and OV Units
2.4. Feed-Forward Network (FFN)
Standard Formulation
Mechanistic View: Neurons
2.5. Sparse Autoencoder (SAE) Feature
Mathematical Formulation
Training Challenges and Resources
3. Localizing Methods
3.1. Magnitude Analysis
Methodological Formulation
Applicable Objects
- Specialized Neurons and SAE Features: By feeding domain-specific datasets into the model and monitoring activations (e,g., neuron activation state or SAE feature activation state ), researchers can isolate components dedicated to specific concepts. For instance, in the context of higher-level reasoning, Galichin et al. [3] utilized SAEs to disentangle the residual stream state . As shown in Figure 4 (a), they proposed a metric called ReasonScore, which aggregates the activation frequency and magnitude of SAE features specifically during “reasoning moments” (e.g., when the model meets tokens like “Wait or “Therefore”). By ranking features based on this score, they successfully localized Reasoning-Relevant SAE features that encode abstract concepts like uncertainty or exploratory thinking. Similarly, for style transfer, Lai et al. [24] employed Magnitude Analysis to identify Style-Specific Neurons. As illustrated in Figure 4 (b), they calculated the average activation magnitude of FFN neurons across corpora with distinct styles (e.g., positive vs. negative). Neurons that exhibited significantly higher average activation for the source style compared to the target style were identified as “Source-Style Neurons,” serving as candidates for subsequent deactivation.
- Attention Heads: The magnitude and distribution of attention scores () serve as a direct indicator of a head’s functional role [10,31,32,33,34,35,36]. For instance, Zhou et al. [35] introduced the Safety Head ImPortant Score (Ships), which aggregates attention weights on refusal-related tokens to localize “Safety Heads” critical for model alignment. In the multimodal domain, Sergeev and Kotelnikov [36] and Bi et al. [10] measured the concentration of attention mass on image tokens versus text tokens, successfully pinpointing heads responsible for visual perception and cross-modal processing. Similarly, Singh et al. [33] measured “induction strength”—derived from the attention probability assigned to token repetition patterns—to track the formation and importance of Induction Heads.
Characteristics and Scope
- Advantages: It does not require training auxiliary classifiers or performing computationally expensive backward passes. This makes it highly scalable and suitable for analyzing large models in real-time.
- Limitations: It serves primarily as a lightweight heuristic. High activation magnitude implies high presence but does not guarantee causal necessity (e.g., a high-magnitude feature might be cancelled out by a subsequent layer). Furthermore, its success relies heavily on the quality of the input data; if the dataset fails to elicit the specific behavior, the relevant components will remain dormant. Therefore, Magnitude Analysis is typically used as a “first-pass" screening tool to filter candidate objects for more rigorous verification methods.
3.2. Causal Attribution
Methodological Formulation
Applicable Objects
- Corrupted Run (Intervention): First, the specific knowledge is erased from the model’s computation. A corrupted input is created by adding Gaussian noise to the embeddings of the subject tokens (e.g., “Space Needle”), causing the probability of the correct prediction (“Seattle”) to drop significantly.
- Patched Run (Restoration): The core operation systematically restores specific internal states. For a specific layer l and token position i, the method copies the hidden activation from a separate original clean run and pastes (restores) it into the corrupted computation graph.
- Effect Measurement: The causal effect is quantified by the Indirect Effect (IE), which measures how much of the original target probability is recovered by this restoration. A high IE score implies that the patched state at carries critical information.
Characteristics and Scope
- Advantages: Unlike Magnitude Analysis (§3.1), which only establish correlation, Causal Attribution provides definitive evidence that a component is a functional driver of the model’s output. This allows researchers to distinguish essential mechanisms from features that are highly activated but causally irrelevant to the specific behavior.
- Limitations: This rigor incurs a significant computational overhead. Verifying causality typically requires intervening on objects individually and performing a separate forward pass for each intervention. Consequently, the cost scales linearly with the number of objects analyzed, making it prohibitively expensive for dense, sweeping searches over large models. This inefficiency often necessitates the use of Gradient Detection (§3.3), which utilizes gradients to rapidly approximate these causal effects, enabling efficient screening before performing expensive, fine-grained interventions.
3.3. Gradient Detection
Methodological Formulation
Applicable Objects
- Neurons (): A standard neuron-level object is the FFN activation vector at layer l. Gradients can be converted into per-neuron scores to rank neurons by their local influence on F. This has been used to localize knowledge- or context-sensitive neurons and analyze their dependencies [64,65,66,67,68,69]. Figure 6 illustrates a concrete LLM-specific instance: Shi et al. [65] computes Integrated Gradients scores to identify neurons most responsible for processing contextual cues under knowledge conflicts (via a context-aware attribution and a high-score intersection criterion), and then reweights the identified neurons to promote context-consistent generation.
- Attention Head Outputs ():Gradient Detection also applies to attention-related activations such as the attention head output . Computing and scalarizing it with yields head-level rankings that can highlight salient heads or attention submodules for further analysis and subsequent intervention [70,71,72].
Characteristics and Scope
- Advantages:Gradient Detection is applicable to a broad class of objects without requiring additional training. Compared with exhaustive interventions, it can produce rankings with a relatively small number of backward passes, making it practical as an initial localization step when the candidate set is large.
- Limitations: Gradients provide a local proxy, not causal necessity: salience can be offset by downstream computation, and finite interventions may depart from first-order effects in non-linear regimes. For these reasons, gradient-ranked objects are typically paired with Causal Attribution (§3.2) to validate whether the identified objects are genuinely responsible for the target behavior.
3.4. Probing
Methodological Formulation
Applicable Objects
Characteristics and Scope
- Advantages: With a fixed probe family, Probing enables standardized comparisons across objects, supporting efficient layer-wise tracking and large-scale ranking of candidate modules. Simple probes (e.g., linear) are lightweight and interpretable, allowing broad sweeps while keeping the LLM frozen.
- Limitations: Decodability is not causality: high probe accuracy does not imply the model uses y, nor that the probed object is necessary or sufficient. Results are sensitive to dataset and design choices (e.g., labeling, token positions), so controls and follow-up causal tests are typically required for functional claims.
3.5. Vocabulary Projection
Methodological Formulation
Applicable Objects
Characteristics and Scope
- Advantages: It provides a zero-shot interpretation method that is computationally efficient and intuitive. Unlike Probing (§3.4), it does not require collecting a labeled dataset or training a separate classifier, allowing for immediate inspection of any model state.
- Limitations: The primary limitation is the assumption that intermediate states exist in the same vector space as the output vocabulary (basis alignment). While this often holds for the residual stream due to the residual connection structure, it may be less accurate for components inside sub-layers (like FFN and MHA) or in models where the representation space rotates significantly across layers. Consequently, results should be interpreted as an approximation of the information that is linearly decodable by the final layer.
3.6. Circuit Discovery
Methodological Formulation
Applicable Objects
Characteristics and Scope
- Advantages:Circuit Discovery yields mechanistically structured explanations: selecting edges reveals how multiple objects compose a computation and exposes cross-layer routing patterns that node-wise rankings can miss. This aligns with transformers’ residual-update structure, where heads and FFNs contribute additive edits that can be tracked as directed dependencies. Attribution-based edge scoring also enables scalable screening of large edge sets when exhaustive interventions are infeasible.
- Limitations: Circuits are defined relative to a specific behavior, metric , and contrast (clean vs. corrupted), so results are often objective- and dataset-dependent. Because attribution scores approximate intervention effects, they may miss non-linear interactions, so rankings are best treated as proposals and typically require intervention-based validation on the retained subgraph.
4. Steering Methods
4.1. Amplitude Manipulation
Methodological Formulation
- Ablation or Patching: Here, the object is suppressed or replaced, i.e., . Setting (Zeroing) or (Mean centering) removes the component’s influence, while (Patching) injects information from a different context.
- Scaling: Here, the activation strength is adjusted via a scalar coefficient , such that . This allows for continuous amplification () or attenuation () of a specific feature’s downstream impact.
Applicable Objects
Characteristics and Scope
- Advantages: It is an optimization-free and reversible intervention. It allows for “surgical” edits to model behavior (e.g., removing specific biases) by simply masking or scaling activations during inference. This makes it highly flexible and suitable for real-time control.
- Limitations: It relies heavily on the accurate localization of the target components. If the features responsible for a behavior are not perfectly disentangled (i.e., polysemantic), ablating or scaling them may cause unintended side effects or degrade general performance. Furthermore, finding the optimal scaling factor often requires empirical tuning.
4.2. Targeted Optimization
Methodological Formulation
Applicable Objects
Characteristics and Scope
- Advantages: It offers strong precision, controllability, and persistence. The desired behavioral change is directly encoded in a target objective, and localization helps minimize interference with unrelated competencies. Consequently, it is well-suited for targeted factual rewrites, controlled specialization, and safety-preserving adaptation where lasting changes are required.
- Limitations: Its reliability hinges on correct localization and well-specified supervision. If the chosen subset does not capture the causal mechanism, optimization may underachieve the intended target behavior, shift the behavior to other objects, or yield brittle side effects. In practice, success often requires carefully constructed target/preservation data and robust criteria for selecting the localized update region.
4.3. Vector Arithmetic
Methodological Formulation
Applicable Objects
-
Contrastive Activation Means: This method, often referred to as “Activation Addition” or “Mass-Mean Shift,” assumes that a concept can be isolated by comparing the model’s internal states across opposing contexts [162,163,164,165,166,167]. Formally, let be a set of prompts eliciting the target behavior and be a set eliciting the opposing behavior. The steering vector is calculated as the difference between the centroids of the residual stream states for these two sets:By adding to the residual stream, we shift the model’s current state towards the centroid of the positive behavior.
-
SAE Features: SAEs offer a more precise way to derive by utilizing monosemantic features [131,168,169,170,171,172]. As illustrated in Figure 12, the process involves two steps:
- Feature Identification: First, we collect residual stream states from a positive dataset (eliciting the target concept, e.g., “Happiness”) and a negative/neutral dataset . By passing these states through the SAE encoder, we calculate the differential activation score for each feature j:where denotes the j-th feature activation for input . Features with high positive constitute the set of “Target Features” that specifically encode the desired trait.
- Vector Construction: The steering vector is then synthesized as the weighted sum of these identified feature. Let denote the j-th feature (the j-th column of the SAE decoder weights ). The steering vector is computed as:
Finally, this obtained steering vector is injected into the model’s residual stream during inference (). As shown in Figure 12 (c), this enables precise manipulation of specific semantic traits like “Happiness” or “Confusion” to drastically alter generation styles while minimizing interference with unrelated concepts.
Characteristics and Scope
- Advantages: It is a lightweight and reversible intervention. Since it typically operates at inference time (for hidden states) or via simple weight addition, it does not require complex optimization or gradient descent during deployment. It allows for flexible control over model behavior by simply adjusting the steering coefficient .
- Limitations: The effectiveness relies on the “Linear Representation Hypothesis.” If the target concept is not encoded linearly or if the steering vector is entangled with other concepts (which is common with “Contrastive Activation Means”), the intervention might introduce unintended side effects.
5. Applications
5.1. Improve Alignment
5.1.1. Safety and Reliability

1) Safety-Critical Component Manipulation
2) Latent Safety and Reliablity Representation Steering
5.1.2. Fairness and Bias

1) Gender Bias Localization and Selective Debiasing
2) Distributed Attribute and Cultural Bias Signals
3) Evaluation Bias Engines in Judgment and Framing
5.1.3. Persona and Role

1) Global Persona Modulation via Vectors
2) Persona-Specific Component Editing
3) Psychological Profiling and Diagnosis
5.2. Improve Capability
5.2.1. Multilingualism

1) Language-Specific Component Manipulation
2) Cross-Lingual Representation Steering in Residual Space
5.2.2. Knowledge Management

1) Precise Knowledge Updating
- Localized Parameter Rewriting: A core result is that many factual associations are mediated by localized pathways, often concentrated in mid-layer FFN output and neuron activation state . Meng et al. [43] used Causal Attribution to identify carriers responsible for factual recall and applied structured weight edits (on FFN matrices such as ) to rewrite specific associations, a process referred to as Targeted Optimization, providing a mechanistic alternative to diffuse fine-tuning. Scaling beyond single edits, Meng et al. [144] extended this paradigm to large edit batches by coordinating updates across multiple layers, demonstrating that persistent rewriting could remain localized while handling substantial edit volume. Subsequent work refined the localization premise: Chen et al. [216] argued that editability was frequently query-conditioned, motivating consistency-aware localization under a broader Query Localization assumption, rather than a fixed set of knowledge neurons. For long-form QA, Chen et al. [27] introduced QRNCA (a form of Causal Attribution), which yielded actionable neuron groups that better tracked query semantics. In multilingual settings, Zhang et al. [150] identified language-agnostic factual neurons via Magnitude Anaylysis and applied Targeted Optimization on these shared neurons to improve cross-lingual edit consistency. In backward propagation, Katz et al. [217] complemented forward analyses with Vocabulary Projection of backward-pass gradients, offering an orthogonal diagnostic on where learning signals concentrated during updates.
- Activation-Space Editing and Unlearning: When persistent rewrites are undesirable (e.g., reversible control or safety-motivated removal), activation-level interventions on residual stream states at layer l, or on head/feature activations, provide a practical alternative. Lai et al. [218] jointly localized and edited attention-head computations (intervening on attention head output ) through gated activation control, instantiating a targeted form of Targeted Optimization. SAE-based approaches decomposed residual stream states into sparse features with activations , enabling feature-level interventions: Muhamed et al. [219] proposed dynamic SAE guardrails that selected and scaled relevant features via Magnitude Analysis to achieve precision unlearning with improved forget–utility trade-offs, while Goyal et al. [131] applied Amplitude Manipulation to steer toxicity-related SAE features (scaling selected via their activations ) to reduce harmful generations with controlled fluency impact.
2) Knowledge Retention and Stability
- Conflict Suppression and Mitigation: Failures under retrieval or context injection often arose from attention heads that mediated the integration of parametric memory and external evidence in the residual stream. Jin et al. [220] performed Causal Attribution to localize conflict-mediating heads and applied test-time head suppression/patching, i.e., Amplitude Manipulation over attention head output , to rebalance memory vs. context usage. Li et al. [221] further used Magnitude Anaylysis to identify heads exhibiting superposition effects and applied targeted gating via Targeted Optimization to stabilize behavior under conflicts. Long-context distraction was traced to entrainment-related heads: Niu et al. [136] localized such heads using Causal Attribution and ablated or modulated their outputs (), reducing echoing of irrelevant context tokens. Jin et al. [9] further characterized concentrated massive values in computations mediated by the Q/K weight matrices and (reflected in attention scores via Magnitude Analysis), then guided Amplitude Manipulation over corresponding head outputs to maintain contextual reading without disrupting magnitude-structured signals.
- Constraining Continual Adaptation: To reduce catastrophic forgetting, MI localized stability-critical carriers and restricted learning via Targeted Optimization. Zhang et al. [74] applied Gradient Detection to identify a “core linguistic” parameter region and froze it, mitigating forgetting. Zhang et al. [66] further constrained adaptation through coarse-to-fine module selection and soft masking, balancing specialty and versatility. Representation-level interventions were also employed: Wu et al. [222] localized residual stream states and applied lightweight edits on with a frozen backbone (a form of Targeted Optimization), improving stability relative to weight-centric updates. Monitoring side effects, Du et al. [88] used Probing over residual stream states and attention heads to detect security-relevant drift and selected safer module update schedules, enabling controlled adaptation.
3) Knowledge Consolidation
5.2.3. Logic and Reasoning

1) Specific Refinement of Numerical and Logical Components
2) Inference Trajectory Steering
3) Stepwise Diagnosis and Correction
5.3. Improve Efficiency
5.3.1. Efficient Training

1) Sparse Fine-tuning
2) Training Dynamic Monitoring
5.3.2. Efficient Inference

1) Selective Computation via Saliency Detection
- Data Level: Researchers have developed advanced token- and KV-cache-level pruning strategies that leverage Magnitude Analysis and Gradient Detection to effectively identify and remove unimportant tokens. By leveraging Magnitude Analysis to identify tokens with minimal contribution to the reasoning process in CoT sequences, TokenSkip [159] selectively skips these tokens, achieving substantial compression with negligible performance degradation. Lei et al. [251] explored explanation-driven token compression for multimodal LLMs, where Gradient Detection is used to map attention patterns to explanation outcomes, enabling the effective pruning of visual tokens during the input stage. For KV cache-level pruning, FitPrune [253] and ZipCache [21] employed Magnitude Analysis saliency metrics to identify and retain critical KV states. Guo et al. [252] introduced Value-Aware Token Pruning (VATP), which applied Magnitude Analysis to attention scores and the L1 norm of value attention vectors to identify crucial tokens. Moving beyond token-wise pruning, Circuit Discovery techniques have been applied to identify “Retrieval Heads” that are essential for long-context tasks, enabling non-critical heads to operate with a fixed-length KV cache [155,254,332,333].
- Model Level: MI-guided metrics enable the skipping of entire architectural blocks, such as redundant layers, MoE experts, or neurons, thereby facilitating inference acceleration with minimal impact on model performance. Men et al. [14] introduced “Block Influence” (BI), a similarity metric based on Magnitude Analysis that compares the input and output of each layer. This technique effectively removes layers with minimal contribution to the representation space. Dynamic bypassing methods, such as GateSkip [255] and LayerSkip [13], employ learnable residual gates to skip layers during inference, also based on Magnitude Analysis. Similarly, HadSkip [257] and SBERT [258] models leverage Magnitude Analysis to facilitate effective layer skipping. In MoE architectures, Lu et al. [259] skipped unimportant experts during inference based on the Magnitude Analysis of router scores. Su et al. [8] further identified Super Experts by analyzing the Magnitude Analysis of experts’ output activations, showing that these experts are essential for logical reasoning and that pruning them leads to catastrophic performance degradation. Finally, by localizing specialized multilingual neurons [25] and language-specific sub-networks [260] through Magnitude Analysis on their activations, LLMs can activate only the sub-circuits necessary for the specific task at hand.
2) Layer-Specific Adaptive Quantization
6. Challenges and Future Directions
Challenges
Future Directions
7. Conclusions
Limitation
Appendix A Summary of Surveyed Papers
| Paper | Object | Localizing Method | Steering Method | Venue | Year | Link |
|---|---|---|---|---|---|---|
| Safety and Reliability (Improve Alignment) | ||||||
| Zhou et al. | MHA | Causal Attribution | Amplitude Manipulation | ICLR | 2025 | Link |
| Huang et al. | MHA | Circuit Discovery | Targeted Optimization | EMNLP | 2025 | Link |
| Jiang et al. | MHA | Causal Attribution | Targeted Optimization | ArXiv | 2024 | Link |
| Chen et al. | Neuron | Causal Attribution | Amplitude Manipulation | ArXiv | 2025 | Link |
| Suau et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | ICML | 2024 | Link |
| Gao et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Zhao et al. | Neuron | Magnitude Analysis | Targeted Optimization | ICLR | 2025 | Link |
| Li et al. | Neuron | Magnitude Analysis | Targeted Optimization | ArXiv | 2025 | Link |
| Templeton et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | Blog | 2024 | Link |
| Goyal et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Yeo et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Li et al. | SAE Feature | Magnitude Analysis | Vector Arithmetic | ArXiv | 2025 | Link |
| Weng et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Wu et al. | SAE Feature | Magnitude Analysis | Vector Arithmetic | ICML | 2025 | Link |
| He et al. | SAE Feature | Magnitude Analysis | Vector Arithmetic | ArXiv | 2025 | Link |
| Li et al. | Residual Stream | Causal Attribution | Targeted Optimization | ICLR | 2025 | Link |
| Lee et al. | Residual Stream | Probing | Targeted Optimization | ICML | 2024 | Link |
| Arditi et al. | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2024 | Link |
| Zhao et al. | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2025 | Link |
| Yin et al. | Residual Stream | Probing | Vector Arithmetic | ArXiv | 2025 | Link |
| Ball et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2024 | Link |
| Wang et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Wang et al. | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2025 | Link |
| Ferreira et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| Huang et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| Pan et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| Chuang et al. | Residual Stream | Vocab Projection | Vector Arithmetic | ICLR | 2024 | Link |
| Chen et al. | Residual Stream | Vocab Projection | Vector Arithmetic | ICML | 2024 | Link |
| Zhang et al. | Residual Stream | Probing | Vector Arithmetic | ACL | 2024 | Link |
| Orgad et al. | Residual Stream | Probing | Vector Arithmetic | ICLR | 2025 | Link |
| Stolfo et al. | Residual Stream | Gradient Detection | Vector Arithmetic | ICLR | 2025 | Link |
| Du et al. | Token Embedding | Gradient Detection | Vector Arithmetic | ArXiv | 2025 | Link |
| Fairness and Bias (Improve Alignment) | ||||||
| Vig et al. | MHA | Causal Attribution | Amplitude Manipulation | NeurIPS | 2020 | Link |
| Chintam et al. | MHA | Causal Attribution | Targeted Optimization | ACLWS | 2023 | Link |
| Wang et al. | MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2025 | Link |
| Kim et al. | MHA | Probing | Vector Arithmetic | ICLR | 2025 | Link |
| Dimino et al. | MHA | Magnitude Analysis | - | ICAIF | 2025 | Link |
| Chandna et al. | MHA | Magnitude Analysis | Amplitude Manipulation | TMLR | 2025 | Link |
| Cai et al. | FFN | Causal Attribution | Targeted Optimization | ICIC | 2024 | Link |
| Ahsan et al. | FFN | Causal Attribution | Amplitude Manipulation | EMNLP | 2025 | Link |
| Li and Gao | FFN | Vocab Projection | Targeted Optimization | ACL | 2025 | Link |
| Yu and Ananiadou | Neuron | Circuit Discovery | Targeted Optimization | ArXiv | 2025 | Link |
| Liu et al. | Neuron | Gradient Detection | Amplitude Manipulation | ICLR | 2024 | Link |
| Yu et al. | Residual Stream | Causal Attribution | - | ArXiv | 2025 | Link |
| Guan et al. | Residual Stream | - | Amplitude Manipulation | ICML | 2025 | Link |
| Yu et al. | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ACL | 2025 | Link |
| Raimondi et al. | Residual Stream | Causal Attribution | Amplitude Manipulation | ArXiv | 2025 | Link |
| Persona and Role (Improve Alignment) | ||||||
| Su et al. | Neuron | Causal Attribution | Amplitude Manipulation | EMNLP | 2025 | Link |
| Deng et al. | Neuron | Causal Attribution | Amplitude Manipulation | ICLR | 2025 | Link |
| Lai et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Chen et al. | Neuron | Causal Attribution | Targeted Optimization | ICML | 2024 | Link |
| Rimsky et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ACL | 2024 | Link |
| Poterti et al. | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Chen et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Handa et al. | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2025 | Link |
| Tak et al. | Residual Stream | Probing | Vector Arithmetic | ACL | 2025 | Link |
| Yuan et al. | Residual Stream | Probing | - | ArXiv | 2025 | Link |
| Ju et al. | Residual Stream | Probing | Targeted Optimization | COLM | 2025 | Link |
| Karny et al. | Residual Stream | Causal Attribution | - | ArXiv | 2025 | Link |
| Banayeeanzade et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Bas and Novak | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Sun et al. | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Pai et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Joshi et al. | Residual Stream | Probing | - | EMNLP | 2024 | Link |
| Ghandeharioun et al. | Residual Stream | Causal Attribution | Vector Arithmetic | NeurIPS | 2024 | Link |
| Multilingualism (Improve Capability) | ||||||
| Xie et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | ACL | 2021 | Link |
| Kojima et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | NAACL | 2024 | Link |
| Tang et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | ACL | 2024 | Link |
| Zhao et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Gurgurov et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Liu et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Jing et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Andrylie et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Brinkmann et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | NAACL | 2025 | Link |
| Libovický et al. | Residual Stream | Probing | - | EMNLP | 2020 | Link |
| Chi et al. | Residual Stream | - | Vector Arithmetic | ACL | 2023 | Link |
| Philippy et al. | Residual Stream | Magnitude Analysis | Vector Arithmetic | ACL | 2023 | Link |
| Wendler et al. | Residual Stream | Vocab Projection | Vector Arithmetic | ACL | 2024 | Link |
| Mousi et al. | Residual Stream | Magnitude Analysis | Vector Arithmetic | ACL | 2024 | Link |
| Hinck et al. | Residual Stream | Probing | Vector Arithmetic | EMNLP | 2024 | Link |
| Zhang et al. | Residual Stream | Magnitude Analysis | Vector Arithmetic | ACL | 2025 | Link |
| Wang et al. | Residual Stream | Vocab Projection | Vector Arithmetic | ACL | 2025 | Link |
| Wu et al. | Residual Stream | Vocab Projection | - | ICLR | 2025 | Link |
| Wang et al. | Residual Stream | Vocab Projection | Vector Arithmetic | EMNLP | 2025 | Link |
| Nie et al. | Residual Stream | Vocab Projection | Vector Arithmetic | EMNLP | 2025 | Link |
| Liu et al. | Residual Stream | Vocab Projection | Vector Arithmetic | EMNLP | 2025 | Link |
| Knowledge Management (Improve Capability) | ||||||
| Meng et al. | FFN | Causal Attribution | Targeted Optimization | NeurIPS | 2022 | Link |
| Meng et al. | FFN | Causal Attribution | Targeted Optimization | ICLR | 2023 | Link |
| Lai et al. | MHA | Magnitude Analysis | Targeted Optimization | ICML | 2025 | Link |
| Li et al. | MHA | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| Jin et al. | MHA | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| Jin et al. | MHA | Causal Attribution | Amplitude Manipulation | ACL | 2024 | Link |
| Lv et al. | MHA | Causal Attribution | Amplitude Manipulation | ArXiv | 2024 | Link |
| Niu et al. | MHA | Causal Attribution | Amplitude Manipulation | ACL | 2025 | Link |
| Zhao et al. | MHA | Probing | Targeted Optimization | EMNLP | 2025 | Link |
| Yadav et al. | FFN & MHA | Magnitude Analysis | Vector Arithmetic | NeurIPS | 2023 | Link |
| Yu and Ananiadou | FFN & MHA | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Zhang et al. | FFN & MHA | Magnitude Analysis | Targeted Optimization | ACL | 2024 | Link |
| Chen et al. | FFN & MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2025 | Link |
| Li et al. | FFN & MHA | Magnitude Analysis | Targeted Optimization | AAAI | 2025 | Link |
| Muhamed and Smith | FFN & MHA | Magnitude Analysis | - | ICML | 2025 | Link |
| Yao et al. | FFN & MHA | Circuit Discovery | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Du et al. | FFN & MHA | Probing | Targeted Optimization | ArXiv | 2024 | Link |
| Zhang et al. | FFN & MHA | Gradient Detection | Targeted Optimization | ACL | 2024 | Link |
| Liu et al. | FFN & MHA | Gradient Detection | Vector Arithmetic | ACL | 2025 | Link |
| Yao et al. | FFN & MHA | Magnitude Analysis | Vector Arithmetic | NeurIPS | 2025 | Link |
| Geva et al. | FFN & MHA | Causal Attribution | - | EMNLP | 2023 | Link |
| Zhang et al. | Neuron | Magnitude Analysis | Targeted Optimization | COLING | 2025 | Link |
| Chen et al. | Neuron | Gradient Detection | Amplitude Manipulation | AAAI | 2024 | Link |
| Shi et al. | Neuron | Gradient Detection | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Chen et al. | Neuron | Gradient Detection | Amplitude Manipulation | AAAI | 2025 | Link |
| Kassem et al. | Neuron | - | Amplitude Manipulation | EMNLP | 2025 | Link |
| Muhamed et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ICML | 2025 | Link |
| Goyal et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Marks et al. | SAE Feature | Circuit Discovery | Amplitude Manipulation | ICLR | 2025 | Link |
| Kang and Choi | Residual Stream | Probing | - | EMNLP | 2023 | Link |
| Katz et al. | Residual Stream | Vocab Projection | Targeted Optimization | EMNLP | 2024 | Link |
| Wu et al. | Residual Stream | Causal Attribution | Targeted Optimization | NeurIPS | 2024 | Link |
| Zhao et al. | Residual Stream | Probing | - | ArXiv | 2024 | Link |
| Ju et al. | Residual Stream | Probing | - | COLING | 2024 | Link |
| Jin et al. | Residual Stream | Probing | - | COLING | 2025 | Link |
| Chen et al. | Residual Stream | Probing | Vector Arithmetic | NeurIPS | 2025 | Link |
| Logic and Reasoning (Improve Capability) | ||||||
| Wu et al. | Token Embedding | Gradient Detection | - | ICML | 2023 | Link |
| You et al. | Token Embedding | Magnitude Analysis | - | EMNLP | 2025 | Link |
| Cywiński et al. | Token Embedding | Causal Attribution | Amplitude Manipulation | Blog | 2025 | Link |
| Cywiński et al. | Token Embedding | Causal Attribution | Amplitude Manipulation | Blog | 2025 | Link |
| Wang et al. | FFN | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Yu and Ananiadou | MHA | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Zhang et al. | MHA | Causal Attribution | Targeted Optimization | ICML | 2024 | Link |
| Yu and Ananiadou | MHA | Causal Attribution | Amplitude Manipulation | EMNLP | 2024 | Link |
| Yu et al. | MHA | Causal Attribution | - | EMNLP | 2025 | Link |
| Stolfo et al. | FFN & MHA | Causal Attribution | - | EMNLP | 2023 | Link |
| Akter et al. | FFN & MHA | Causal Attribution | - | COMPSAC | 2024 | Link |
| Yang et al. | FFN & MHA | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Quirke and Barez | FFN & MHA | Causal Attribution | Amplitude Manipulation | ICLR | 2024 | Link |
| Chen et al. | FFN & MHA | Gradient Detection | Targeted Optimization | ACL | 2025 | Link |
| Hanna et al. | FFN & MHA | Circuit Discovery | - | NeurIPS | 2023 | Link |
| Nikankin et al. | FFN & MHA | Circuit Discovery | - | ICLR | 2025 | Link |
| Galichin et al. | SAE Feature | Magnitude Analysis | Vector Arithmetic | ArXiv | 2025 | Link |
| Pach et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Troitskii et al. | SAE Feature | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Venhoff et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Højer et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Tang et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ACL | 2025 | Link |
| Hong et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ACL | 2025 | Link |
| Zhang and Viteri | Residual Stream | Causal Attribution | Vector Arithmetic | ICLR | 2025 | Link |
| Liu et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ArXiv | 2025 | Link |
| Sinii et al. | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Li et al. | Residual Stream | Causal Attribution | Vector Arithmetic | EMNLP | 2025 | Link |
| Ward et al. | Residual Stream | Causal Attribution | Vector Arithmetic | ICML | 2025 | Link |
| Biran et al. | Residual Stream | Probing | - | EMNLP | 2024 | Link |
| Ye et al. | Residual Stream | Probing | - | ICLR | 2025 | Link |
| Sun et al. | Residual Stream | Probing | - | EMNLP | 2025 | Link |
| Wang et al. | Residual Stream | Probing | Vector Arithmetic | AAAI | 2026 | Link |
| Tan et al. | Residual Stream | Vocab Projection | Targeted Optimization | ArXiv | 2025 | Link |
| Efficient Training (Improve Efficiency) | ||||||
| Panigrahi et al. | Neuron | Magnitude Analysis | Targeted Optimization | ICML | 2023 | Link |
| Zhu et al. | Neuron | Gradient Detection | Targeted Optimization | ACL | 2024 | Link |
| Song et al. | Neuron | Gradient Detection | Targeted Optimization | ICML | 2024 | Link |
| Zhang et al. | Neuron | Magnitude Analysis | Targeted Optimization | ACL | 2023 | Link |
| Xu et al. | Neuron | Magnitude Analysis | Targeted Optimization | COLING | 2025 | Link |
| Mondal et al. | Neuron | Magnitude Analysis | Targeted Optimization | ACL | 2025 | Link |
| Gurgurov et al. | Neuron | Magnitude Analysis | Targeted Optimization | AACL | 2025 | Link |
| Zhao et al. | Neuron | Causal Attribution | Targeted Optimization | NeurIPS | 2024 | Link |
| Li et al. | Neuron | Magnitude Analysis | - | ArXiv | 2025 | Link |
| Sergeev and Kotelnikov | MHA | Magnitude Analysis | Targeted Optimization | ICAI | 2025 | Link |
| Olsson et al. | MHA | Magnitude Analysis | - | ArXiv | 2022 | Link |
| Wang et al. | MHA | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Singh et al. | MHA | Magnitude Analysis | - | ICML | 2024 | Link |
| Hoogland et al. | MHA | Magnitude Analysis | - | TLMR | 2025 | Link |
| Minegishi et al. | MHA | Magnitude Analysis | - | ICLR | 2025 | Link |
| Lai et al. | MHA | Magnitude Analysis | Vector Arithmetic | ICML | 2025 | Link |
| Thilak et al. | FFN & MHA | Magnitude Analysis | - | NeurIPS | 2022 | Link |
| Varma et al. | FFN & MHA | Magnitude Analysis | - | ArXiv | 2023 | Link |
| Furuta et al. | FFN & MHA | Magnitude Analysis | - | TMLR | 2024 | Link |
| Nanda et al. | FFN & MHA | Magnitude Analysis | - | ICLR | 2023 | Link |
| Notsawo Jr et al. | FFN & MHA | Magnitude Analysis | - | ArXiv | 2023 | Link |
| Qiye et al. | FFN & MHA | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Liu et al. | FFN & MHA | Magnitude Analysis | - | ICLR | 2023 | Link |
| Wang et al. | FFN & MHA | Magnitude Analysis | - | NeurIPS | 2024 | Link |
| Huang et al. | FFN & MHA | Magnitude Analysis | - | COLM | 2024 | Link |
| Li et al. | FFN & MHA | Circuit Discovery | Targeted Optimization | ArXiv | 2025 | Link |
| Efficient Inference (Improve Efficiency) | ||||||
| Xia et al. | Token Embedding | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2025 | Link |
| Lei et al. | Token Embedding | Gradient Detection | Amplitude Manipulation | ArXiv | 2025 | Link |
| Guo et al. | Token Embedding | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2024 | Link |
| Ye et al. | Token Embedding | Magnitude Analysis | Amplitude Manipulation | AAAI | 2025 | Link |
| He et al. | Token Embedding | Magnitude Analysis | Amplitude Manipulation | NeurIPS | 2024 | Link |
| Cai et al. | Token Embedding | Magnitude Analysis | Amplitude Manipulation | COLM | 2025 | Link |
| Tang et al. | MHA | Circuit Discovery | Amplitude Manipulation | ICLR | 2025 | Link |
| Xiao et al. | MHA | Circuit Discovery | Amplitude Manipulation | ICLR | 2025 | Link |
| Bi et al. | MHA | Magnitude Analysis | - | CVPR | 2025 | Link |
| Su et al. | MHA | Magnitude Analysis | Amplitude Manipulation | IJCAI | 2025 | Link |
| Xiao et al. | MHA | Magnitude Analysis | Amplitude Manipulation | ICLR | 2024 | Link |
| Lu et al. | FFN | Magnitude Analysis | Amplitude Manipulation | ACL | 2024 | Link |
| Su et al. | FFN | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Yu et al. | FFN | Magnitude Analysis | Amplitude Manipulation | Arxiv | 2024 | Link |
| Liu et al. | Neuron | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2024 | Link |
| Tan et al. | Neuron | Magnitude Analysis | - | EMNLP | 2024 | Link |
| Laitenberger et al. | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Valade | Residual Stream | Probing | Amplitude Manipulation | ArXiv | 2024 | Link |
| Elhoushi et al. | Residual Stream | Probing | Amplitude Manipulation | ACL | 2024 | Link |
| Wang et al. | Residual Stream | Magnitude Analysis | Amplitude Manipulation | EMNLP | 2023 | Link |
| Lawson and Aitchison | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ArXiv | 2025 | Link |
| Men et al. | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ACL | 2025 | Link |
| Dumitru et al. | Residual Stream | Magnitude Analysis | - | ArXiv | 2024 | Link |
| Zhang et al. | Residual Stream | Magnitude Analysis | - | ArXiv | 2025 | Link |
| Xiao et al. | Residual Stream | Magnitude Analysis | - | ArXiv | 2025 | Link |
| Ranjan and Savakis | Residual Stream | Gradient Detection | - | ArXiv | 2025 | Link |
| Zeng et al. | Residual Stream | Vocab Projection | - | ArXiv | 2024 | Link |
| Shelke et al. | Residual Stream | Magnitude Analysis | Amplitude Manipulation | ACL | 2024 | Link |
| Lin et al. | FFN & MHA | Magnitude Analysis | Amplitude Manipulation | MLSyS | 2024 | Link |
| Ashkboos et al. | FFN & MHA | Magnitude Analysis | Amplitude Manipulation | NeurIPS | 2025 | Link |
| Su and Yuan | FFN & MHA | Circuit Discovery | - | COLM | 2025 | Link |
| Xiao et al. | FFN & MHA | Magnitude Analysis | Amplitude Manipulation | NeurIPS | 2022 | Link |
| Sun et al. | FFN & MHA | Magnitude Analysis | - | NeurIPS | 2024 | Link |
| An et al. | FFN & MHA | Circuit Discovery | - | ICLR | 2025 | Link |
| Bondarenko et al. | FFN & MHA | Circuit Discovery | - | NeurIPS | 2023 | Link |
References
- Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in neural information processing systems 2022, 35, 30318–30332.
- Tang, T.; Luo, W.; Huang, H.; Zhang, D.; Wang, X.; Zhao, X.; Wei, F.; Wen, J.R. Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 5701–5715. [CrossRef]
- Galichin, A.; Dontsov, A.; Druzhinina, P.; Razzhigaev, A.; Rogov, O.Y.; Tutubalina, E.; Oseledets, I. I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders. arXiv preprint arXiv:2503.18878 2025.
- Zhang, F.; Liu, Y.; Li, W.; Lv, J.; Wang, X.; Bai, Q. Towards Superior Quantization Accuracy: A Layer-sensitive Approach. arXiv preprint arXiv:2503.06518 2025.
- An, Y.; Zhao, X.; Yu, T.; Tang, M.; Wang, J. Systematic outliers in large language models. arXiv preprint arXiv:2502.06415 2025.
- Su, Z.; Yuan, K. KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs. arXiv preprint arXiv:2508.04257 2025.
- Huben, R.; Cunningham, H.; Smith, L.R.; Ewart, A.; Sharkey, L. Sparse Autoencoders Find Highly Interpretable Features in Language Models. In Proceedings of the The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- Su, Z.; Li, Q.; Zhang, H.; Qian, Y.; Xie, Y.; Yuan, K. Unveiling super experts in mixture-of-experts large language models. arXiv preprint arXiv:2507.23279 2025.
- Jin, M.; Mei, K.; Xu, W.; Sun, M.; Tang, R.; Du, M.; Liu, Z.; Zhang, Y. Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding. In Proceedings of the Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025.
- Bi, J.; Guo, J.; Tang, Y.; Wen, L.B.; Liu, Z.; Wang, B.; Xu, C. Unveiling visual perception in language models: An attention head analysis approach. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 4135–4144.
- Chuang, Y.S.; Xie, Y.; Luo, H.; Kim, Y.; Glass, J.R.; He, P. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
- Zhang, S.; Yu, T.; Feng, Y. TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 8908–8949. [CrossRef]
- Elhoushi, M.; Shrivastava, A.; Liskovich, D.; Hosmer, B.; Wasti, B.; Lai, L.; Mahmoud, A.; Acun, B.; Agarwal, S.; Roman, A.; et al. LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 12622–12642. [CrossRef]
- Men, X.; Xu, M.; Zhang, Q.; Yuan, Q.; Wang, B.; Lin, H.; Lu, Y.; Han, X.; Chen, W. ShortGPT: Layers in Large Language Models are More Redundant Than You Expect. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 20192–20204. [CrossRef]
- Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proceedings of the International conference on machine learning. PMLR, 2023, pp. 38087–38099.
- Ashkboos, S.; Mohtashami, A.; Croci, M.L.; Li, B.; Cameron, P.; Jaggi, M.; Alistarh, D.; Hoefler, T.; Hensman, J. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems 2024, 37, 100213–100240.
- Yu, M.; Wang, D.; Shan, Q.; Reed, C.J.; Wan, A. The super weight in large language models. arXiv preprint arXiv:2411.07191 2024.
- Cai, Z.; Zhang, Y.; Gao, B.; Liu, Y.; Li, Y.; Liu, T.; Lu, K.; Xiong, W.; Dong, Y.; Hu, J.; et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 2024.
- Xiong, J.; Fan, L.; Shen, H.; Su, Z.; Yang, M.; Kong, L.; Wong, N. DoPE: Denoising Rotary Position Embedding. arXiv preprint arXiv:2511.09146 2025.
- Xiong, J.; Chen, Q.; Ye, F.; Wan, Z.; Zheng, C.; Zhao, C.; Shen, H.; Li, A.H.; Tao, C.; Tan, H.; et al. ATTS: Asynchronous Test-Time Scaling via Conformal Prediction. arXiv preprint arXiv:2509.15148 2025.
- He, Y.; Zhang, L.; Wu, W.; Liu, J.; Zhou, H.; Zhuang, B. Zipcache: Accurate and efficient kv cache quantization with salient token identification. Advances in Neural Information Processing Systems 2024, 37, 68287–68307.
- Su, Z.; Chen, Z.; Shen, W.; Wei, H.; Li, L.; Yu, H.; Yuan, K. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations. arXiv preprint arXiv:2501.16383 2025.
- Yuan, J.; Gao, H.; Dai, D.; Luo, J.; Zhao, L.; Zhang, Z.; Xie, Z.; Wei, Y.; Wang, L.; Xiao, Z.; et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 23078–23097.
- Lai, W.; Hangya, V.; Fraser, A. Style-Specific Neurons for Steering LLMs in Text Style Transfer. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 13427–13443. [CrossRef]
- Liu, W.; Xu, Y.; Xu, H.; Chen, J.; Hu, X.; Wu, J. Unraveling babel: Exploring multilingual activation patterns within large language models. arXiv 2024.
- Chen, R.; Hu, T.; Feng, Y.; Liu, Z. Learnable Privacy Neurons Localization in Language Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 256–264. [CrossRef]
- Chen, L.; Dejl, A.; Toni, F. Identifying Query-Relevant Neurons in Large Language Models for Long-Form Texts. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA; Walsh, T.; Shah, J.; Kolter, Z., Eds. AAAI Press, 2025, pp. 23595–23604. [CrossRef]
- Wang, S.; Lei, Z.; Tan, Z.; Ding, J.; Zhao, X.; Dong, Y.; Wu, G.; Chen, T.; Chen, C.; Zhang, A.; et al. BrainMAP: Learning Multiple Activation Pathways in Brain Networks. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA; Walsh, T.; Shah, J.; Kolter, Z., Eds. AAAI Press, 2025, pp. 14432–14440. [CrossRef]
- Andrylie, L.M.; Rahmanisa, I.; Ihsani, M.K.; Wicaksono, A.F.; Wibowo, H.A.; Aji, A.F. Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages, 2025, [arXiv:cs.CL/2507.11230].
- Gurgurov, D.; Trinley, K.; Ghussin, Y.A.; Baeumel, T.; van Genabith, J.; Ostermann, S. Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation, 2025, [arXiv:cs.CL/2507.22608].
- Xiao, G.; Tian, Y.; Chen, B.; Han, S.; Lewis, M. Efficient Streaming Language Models with Attention Sinks. arXiv 2023.
- Cancedda, N. Spectral Filters, Dark Signals, and Attention Sinks. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 4792–4808.
- Singh, A.K.; Moskovitz, T.; Hill, F.; Chan, S.C.; Saxe, A.M. What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning, 2024, pp. 45637–45662.
- Wang, M.; Yu, R.; Wu, L.; et al. How Transformers Implement Induction Heads: Approximation and Optimization Analysis. arXiv e-prints 2024, pp. arXiv–2410.
- Zhou, Z.; Yu, H.; Zhang, X.; Xu, R.; Huang, F.; Wang, K.; Liu, Y.; Fang, J.; Li, Y. On the Role of Attention Heads in Large Language Model Safety, 2025, [arXiv:cs.CL/2410.13708].
- Sergeev, A.; Kotelnikov, E. Optimizing Multimodal Language Models through Attention-based Interpretability, 2025, [arXiv:cs.CL/2511.23375].
- Sun, S.; Baek, S.Y.; Kim, J.H. Personality Vector: Modulating Personality of Large Language Models by Model Merging. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 24667–24688. [CrossRef]
- Bas, T.; Novak, K. Steering Latent Traits, Not Learned Facts: An Empirical Study of Activation Control Limits. arXiv preprint arXiv:2511.18284 2025.
- Dumitru, R.G.; Yadav, V.; Maheshwary, R.; Clotan, P.I.; Madhusudhan, S.T.; Surdeanu, M. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels. arXiv preprint arXiv:2406.17415 2024.
- Tan, Z.; Dong, D.; Zhao, X.; Peng, J.; Cheng, Y.; Chen, T. Dlo: Dynamic layer operation for efficient vertical scaling of llms. arXiv preprint arXiv:2407.11030 2024.
- Lawson, T.; Aitchison, L. Learning to Skip the Middle Layers of Transformers, 2025, [arXiv:cs.LG/2506.21103].
- Vig, J.; Gehrmann, S.; Belinkov, Y.; Qian, S.; Nevo, D.; Singer, Y.; Shieber, S. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; Lin, H., Eds. Curran Associates, Inc., 2020, Vol. 33, pp. 12388–12401.
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and Editing Factual Associations in GPT. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
- Zhang, F.; Nanda, N. Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042 2023.
- Stolfo, A.; Belinkov, Y.; Sachan, M. A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H.; Pino, J.; Bali, K., Eds., Singapore, 2023; pp. 7035–7052. [CrossRef]
- Yu, Z.; Ananiadou, S. How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y., Eds. Association for Computational Linguistics, 2024, pp. 3281–3292. [CrossRef]
- Geiger, A.; Ibeling, D.; Zur, A.; Chaudhary, M.; Chauhan, S.; Huang, J.; Arora, A.; Wu, Z.; Goodman, N.; Potts, C.; et al. Causal abstraction: A theoretical foundation for mechanistic interpretability. Journal of Machine Learning Research 2025, 26, 1–64.
- Ferreira, P.; Aziz, W.; Titov, I. Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations. In Proceedings of the Workshop on Actionable Interpretability at ICML 2025, 2025, [arXiv:cs.CL/2504.05294].
- Yeo, W.J.; Satapathy, R.; Cambria, E. Towards faithful natural language explanations: A study using activation patching in large language models. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10436–10458.
- Ravindran, S.K. Adversarial activation patching: A framework for detecting and mitigating emergent deception in safety-aligned transformers. arXiv preprint arXiv:2507.09406 2025.
- Yu, H.; Jeong, S.; Pawar, S.; Shin, J.; Jin, J.; Myung, J.; Oh, A.; Augenstein, I. Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models, 2025, [arXiv:cs/2508.08879]. [CrossRef]
- Wang, K.R.; Variengien, A.; Conmy, A.; Shlegeris, B.; Steinhardt, J. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Geva, M.; Bastings, J.; Filippova, K.; Globerson, A. Dissecting Recall of Factual Associations in Auto-Regressive Language Models. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H.; Pino, J.; Bali, K., Eds., Singapore, 2023; pp. 12216–12235. [CrossRef]
- Yu, Z.; Ananiadou, S. Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 3293–3306. [CrossRef]
- Li, J.; Chen, X.; Hovy, E.H.; Jurafsky, D. Visualizing and Understanding Neural Models in NLP. In Proceedings of the NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016; Knight, K.; Nenkova, A.; Rambow, O., Eds. The Association for Computational Linguistics, 2016, pp. 681–691. [CrossRef]
- Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017; Precup, D.; Teh, Y.W., Eds. PMLR, 2017, Vol. 70, Proceedings of Machine Learning Research, pp. 3319–3328.
- Smilkov, D.; Thorat, N.; Kim, B.; Viégas, F.B.; Wattenberg, M. SmoothGrad: removing noise by adding noise. CoRR 2017, abs/1706.03825, [1706.03825].
- Shrikumar, A.; Greenside, P.; Kundaje, A. Learning Important Features Through Propagating Activation Differences. In Proceedings of the Proceedings of the 34th International Conference on Machine Learning (ICML), 2017, Vol. 70, Proceedings of Machine Learning Research.
- Enguehard, J. Sequential Integrated Gradients: a simple but effective method for explaining language models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023; Rogers, A.; Boyd-Graber, J.L.; Okazaki, N., Eds. Association for Computational Linguistics, 2023, pp. 7555–7565. [CrossRef]
- Wu, S.; Shen, E.M.; Badrinath, C.; Ma, J.; Lakkaraju, H. Analyzing chain-of-thought prompting in large language models via gradient-based feature attributions. arXiv preprint arXiv:2307.13339 2023.
- Hou, E.M.; Castañón, G.D. Decoding Layer Saliency in Language Transformers. In Proceedings of the International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA; Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; Scarlett, J., Eds. PMLR, 2023, Vol. 202, Proceedings of Machine Learning Research, pp. 13285–13308.
- Tao, Y.; Tang, Y.; Wang, Y.; Zhu, M.; Hu, H.; Wang, Y. Saliency-driven Dynamic Token Pruning for Large Language Models. CoRR 2025, abs/2504.04514, [2504.04514]. [CrossRef]
- Nguyen, D.; Prasad, A.; Stengel-Eskin, E.; Bansal, M. GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs. CoRR 2025, abs/2507.18043, [2507.18043]. [CrossRef]
- Dai, D.; Dong, L.; Hao, Y.; Sui, Z.; Chang, B.; Wei, F. Knowledge Neurons in Pretrained Transformers. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022; Muresan, S.; Nakov, P.; Villavicencio, A., Eds. Association for Computational Linguistics, 2022, pp. 8493–8502. [CrossRef]
- Shi, D.; Jin, R.; Shen, T.; Dong, W.; Wu, X.; Xiong, D. IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
- Zhang, H.; Wu, Y.; Li, D.; Yang, S.; Zhao, R.; Jiang, Y.; Tan, F. Balancing Speciality and Versatility: a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024; Ku, L.; Martins, A.; Srikumar, V., Eds. Association for Computational Linguistics, 2024, pp. 7467–7509. [CrossRef]
- Zhang, H.; Liu, Z.; Huang, S.; Shang, C.; Zhan, B.; Jiang, Y. Improving low-resource knowledge tracing tasks by supervised pre-training and importance mechanism fine-tuning. arXiv preprint arXiv:2403.06725 2024.
- Li, M.; Li, Y.; Zhou, T. What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 32017–32154. [CrossRef]
- Li, M.; Li, Y.; Li, Z.; Zhou, T. How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients, 2025, [arXiv:cs.LG/2504.10766].
- Jafari, F.R.; Eberle, O.; Khakzar, A.; Nanda, N. RelP: Faithful and Efficient Circuit Discovery via Relevance Patching. CoRR 2025, abs/2508.21258, [2508.21258]. [CrossRef]
- Azarkhalili, B.; Libbrecht, M.W. Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds. Association for Computational Linguistics, 2025, pp. 19954–19974.
- Liu, S.; Wu, H.; He, B.; Han, X.; Yuan, M.; Song, L. Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds. Association for Computational Linguistics, 2025, pp. 19243–19255.
- Li, H.; Zhang, X.; Liu, X.; Gong, Y.; Wang, Y.; Chen, Q.; Cheng, P. Enhancing Large Language Model Performance with Gradient-Based Parameter Selection. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA; Walsh, T.; Shah, J.; Kolter, Z., Eds. AAAI Press, 2025, pp. 24431–24439. [CrossRef]
- Zhang, Z.; Zhao, J.; Zhang, Q.; Gui, T.; Huang, X. Unveiling Linguistic Regions in Large Language Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024; Ku, L.; Martins, A.; Srikumar, V., Eds. Association for Computational Linguistics, 2024, pp. 6228–6247. [CrossRef]
- Li, G.; Xi, Z.; Zhang, Z.; Hong, B.; Gui, T.; Zhang, Q.; Huang, X. LoRACoE: Improving Large Language Model via Composition-based LoRA Expert. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 31290–31304. [CrossRef]
- Wang, Y.; Zhang, T.; Guo, X.; Shen, Z. Gradient based Feature Attribution in Explainable AI: A Technical Review. CoRR 2024, abs/2403.10415, [2403.10415]. [CrossRef]
- Wang, K.; Variengien, A.; Conmy, A.; Shlegeris, B.; Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593 2022.
- Yin, K.; Neubig, G. Interpreting Language Models with Contrastive Explanations. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Goldberg, Y.; Kozareva, Z.; Zhang, Y., Eds., Abu Dhabi, United Arab Emirates, 2022; pp. 184–198. [CrossRef]
- Alain, G.; Bengio, Y. Understanding intermediate layers using linear classifier probes. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Workshop Track Proceedings. OpenReview.net, 2017.
- Belinkov, Y. Probing Classifiers: Promises, Shortcomings, and Advances. Comput. Linguistics 2022, 48, 207–219. [CrossRef]
- Conneau, A.; Kruszewski, G.; Lample, G.; Barrault, L.; Baroni, M. What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers; Gurevych, I.; Miyao, Y., Eds. Association for Computational Linguistics, 2018, pp. 2126–2136. [CrossRef]
- Tenney, I.; Das, D.; Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers; Korhonen, A.; Traum, D.R.; Màrquez, L., Eds. Association for Computational Linguistics, 2019, pp. 4593–4601. [CrossRef]
- Ravichander, A.; Belinkov, Y.; Hovy, E.H. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance? In Proceedings of the Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021; Merlo, P.; Tiedemann, J.; Tsarfaty, R., Eds. Association for Computational Linguistics, 2021, pp. 3363–3377. [CrossRef]
- Ju, T.; Sun, W.; Du, W.; Yuan, X.; Ren, Z.; Liu, G. How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study. In Proceedings of the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy; Calzolari, N.; Kan, M.; Hoste, V.; Lenci, A.; Sakti, S.; Xue, N., Eds. ELRA and ICCL, 2024, pp. 8235–8246.
- Zhao, Y.; Du, X.; Hong, G.; Gema, A.P.; Devoto, A.; Wang, H.; He, X.; Wong, K.; Minervini, P. Analysing the Residual Stream of Language Models Under Knowledge Conflicts. CoRR 2024, abs/2410.16090, [2410.16090]. [CrossRef]
- Orgad, H.; Toker, M.; Gekhman, Z.; Reichart, R.; Szpektor, I.; Kotek, H.; Belinkov, Y. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- You, W.; Xue, A.; Havaldar, S.; Rao, D.; Jin, H.; Callison-Burch, C.; Wong, E. Probabilistic Soundness Guarantees in LLM Reasoning Chains. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 2025; pp. 7517–7536. [CrossRef]
- Du, Y.; Zhao, S.; Cao, J.; Ma, M.; Zhao, D.; Fan, F.; Liu, T.; Qin, B. Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning. CoRR 2024, abs/2410.04524, [2410.04524]. [CrossRef]
- Zhao, D.; Liu, X.; Feng, X.; Wang, H.; Qin, B. Probing and Boosting Large Language Models Capabilities via Attention Heads. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 28518–28532. [CrossRef]
- Kim, J.; Evans, J.; Schein, A. Linear Representations of Political Perspective Emerge in Large Language Models. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Kantamneni, S.; Engels, J.; Rajamanoharan, S.; Tegmark, M.; Nanda, N. Are Sparse Autoencoders Useful? A Case Study in Sparse Probing. In Proceedings of the Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025.
- Chanin, D.; Wilken-Smith, J.; Dulka, T.; Bhatnagar, H.; Bloom, J. A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. CoRR 2024, abs/2409.14507, [2409.14507]. [CrossRef]
- nostalgebraist. Interpreting GPT: the Logit Lens, 2020.
- Geva, M.; Schuster, R.; Berant, J.; Levy, O. Transformer Feed-Forward Layers Are Key-Value Memories. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Moens, M.F.; Huang, X.; Specia, L.; Yih, S.W.t., Eds., Online and Punta Cana, Dominican Republic, 2021; pp. 5484–5495. [CrossRef]
- Belrose, N.; Furman, Z.; Smith, L.; Halawi, D.; Ostrovsky, I.; McKinney, L.; Biderman, S.; Steinhardt, J. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112 2023.
- Jiang, C.; Qi, B.; Hong, X.; Fu, D.; Cheng, Y.; Meng, F.; Yu, M.; Zhou, B.; Zhou, J. On Large Language Models’ Hallucination with Regard to Known Facts. In Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 1041–1053.
- Jiang, N.; Kachinthaya, A.; Petryk, S.; Gandelsman, Y. Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations, 2025, [arXiv:cs.CV/2410.02762].
- Wendler, C.; Veselovsky, V.; Monea, G.; West, R. Do Llamas Work in English? On the Latent Language of Multilingual Transformers. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 15366–15394. [CrossRef]
- Kargaran, A.H.; Liu, Y.; Yvon, F.; Schütze, H. How Programming Concepts and Neurons Are Shared in Code Language Models. arXiv preprint arXiv:2506.01074 2025.
- Phukan, A.; Somasundaram, S.; Saxena, A.; Goswami, K.; Srinivasan, B.V. Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 11481–11495. [CrossRef]
- Phukan, A.; Divyansh.; Morj, H.K.; Vaishnavi.; Saxena, A.; Goswami, K. Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs. In Proceedings of the Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Chiruzzo, L.; Ritter, A.; Wang, L., Eds., Albuquerque, New Mexico, 2025; pp. 9661–9675. [CrossRef]
- Yugeswardeenoo, D.; Nukala, H.; Blondin, C.; O’Brien, S.; Sharma, V.; Zhu, K. Interpreting the Latent Structure of Operator Precedence in Language Models. In Proceedings of the The First Workshop on the Interplay of Model Behavior and Model Internals, 2025.
- Sakarvadia, M.; Khan, A.; Ajith, A.; Grzenda, D.; Hudson, N.; Bauer, A.; Chard, K.; Foster, I. Attention lens: A tool for mechanistically interpreting the attention head information retrieval mechanism. arXiv preprint arXiv:2310.16270 2023.
- Yu, Z.; Ananiadou, S. Understanding multimodal llms: the mechanistic interpretability of llava in visual question answering. arXiv preprint arXiv:2411.10950 2024.
- Jiang, Z.; Chen, J.; Zhu, B.; Luo, T.; Shen, Y.; Yang, X. Devils in middle layers of large vision-language models: Interpreting, detecting and mitigating object hallucinations via attention lens. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 25004–25014.
- Kim, J.; Kang, S.; Park, J.; Kim, J.; Hwang, S.J. Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision–Language Models. In Proceedings of the Mechanistic Interpretability Workshop at NeurIPS 2025, 2025.
- Wang, Z. Logitlens4llms: Extending logit lens analysis to modern large language models. arXiv preprint arXiv:2503.11667 2025.
- Huo, J.; Yan, Y.; Hu, B.; Yue, Y.; Hu, X. MMNeuron: Discovering Neuron-Level Domain-Specific Interpretation in Multimodal Large Language Model. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 6801–6816.
- Yu, Z.; Ananiadou, S. Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing, 2025, [arXiv:cs/2501.14457]. [CrossRef]
- Shao, J.; Lu, Y.; Yang, J. Benford’s Curse: Tracing Digit Bias to Numerical Hallucination in LLMs. arXiv preprint arXiv:2506.01734 2025.
- Arad, D.; Mueller, A.; Belinkov, Y. SAEs Are Good for Steering–If You Select the Right Features. arXiv preprint arXiv:2505.20063 2025.
- Dreyer, M.; Hufe, L.; Berend, J.; Wiegand, T.; Lapuschkin, S.; Samek, W. From What to How: Attributing CLIP’s Latent Components Reveals Unexpected Semantic Reliance. arXiv preprint arXiv:2505.20229 2025.
- Muhamed, A.; Diab, M.; Smith, V. Decoding dark matter: Specialized sparse autoencoders for interpreting rare concepts in foundation models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 1604–1635.
- Gur-Arieh, Y.; Mayan, R.; Agassy, C.; Geiger, A.; Geva, M. Enhancing automated interpretability with output-centric feature descriptions. arXiv preprint arXiv:2501.08319 2025.
- Shu, D.; Wu, X.; Zhao, H.; Rai, D.; Yao, Z.; Liu, N.; Du, M. A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 1690–1712. [CrossRef]
- Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; et al. A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Olsson, C.; Elhage, N.; Nanda, N.; Joseph, N.; DasSarma, N.; Henighan, T.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; et al. In-context Learning and Induction Heads, 2022, [arXiv:cs.LG/2209.11895].
- Hanna, M.; Liu, O.; Variengien, A. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Yao, Y.; Zhang, N.; Xi, Z.; Wang, M.; Xu, Z.; Deng, S.; Chen, H. Knowledge Circuits in Pretrained Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; Zhang, C., Eds. Curran Associates, Inc., 2024, Vol. 37, pp. 118571–118602. [CrossRef]
- Goldowsky-Dill, N.; MacLeod, C.; Sato, L.; Arora, A. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969 2023.
- Conmy, A.; Mavor-Parker, A.; Lynch, A.; Heimersheim, S.; Garriga-Alonso, A. Towards Automated Circuit Discovery for Mechanistic Interpretability. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds. Curran Associates, Inc., 2023, Vol. 36, pp. 16318–16352.
- Syed, A.; Rager, C.; Conmy, A. Attribution Patching Outperforms Automated Circuit Discovery. In Proceedings of the Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP; Belinkov, Y.; Kim, N.; Jumelet, J.; Mohebbi, H.; Mueller, A.; Chen, H., Eds., Miami, Florida, US, 2024; pp. 407–416. [CrossRef]
- Hanna, M.; Pezzelle, S.; Belinkov, Y. Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms. In Proceedings of the First Conference on Language Modeling, 2024.
- Huang, H.; Yan, Y.; Huo, J.; Zou, X.; Li, X.; Wang, K.; Hu, X. Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 15471–15490. [CrossRef]
- Haklay, T.; Orgad, H.; Bau, D.; Mueller, A.; Belinkov, Y. Position-aware Automatic Circuit Discovery. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds. Association for Computational Linguistics, 2025, pp. 2792–2817.
- Mueller, A.; Geiger, A.; Wiegreffe, S.; Arad, D.; Arcuschin, I.; Belfki, A.; Chan, Y.S.; Fiotto-Kaufman, J.F.; Haklay, T.; Hanna, M.; et al. MIB: A Mechanistic Interpretability Benchmark. In Proceedings of the Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025.
- Bricken, T.; Templeton, A.; Batson, J.; Chen, B.; Jermyn, A.; Conerly, T.; Turner, N.; Anil, C.; Denison, C.; Askell, A.; et al. Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Ameisen, E.; Lindsey, J.; Pearce, A.; Gurnee, W.; Turner, N.L.; Chen, B.; Citro, C.; Abrahams, D.; Carter, S.; Hosmer, B.; et al. Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits Thread 2025.
- Hanna, M.; Piotrowski, M.; Lindsey, J.; Ameisen, E. Circuit-Tracer: A New Library for Finding Feature Circuits. In Proceedings of the Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP; Belinkov, Y.; Mueller, A.; Kim, N.; Mohebbi, H.; Chen, H.; Arad, D.; Sarti, G., Eds., Suzhou, China, 2025; pp. 239–249. [CrossRef]
- Nie, E.; Schmid, H.; Schuetze, H. Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 690–706. [CrossRef]
- Goyal, A.; Rathi, V.; Yeh, W.; Wang, Y.; Chen, Y.; Sundaram, H. Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 12702–12720. [CrossRef]
- Yeo, W.J.; Prakash, N.; Neo, C.; Satapathy, R.; Lee, R.K.W.; Cambria, E. Understanding Refusal in Language Models with Sparse Autoencoders. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 6377–6399. [CrossRef]
- Liu, Y.; Liu, Y.; Chen, X.; Chen, P.Y.; Zan, D.; Kan, M.Y.; Ho, T.Y. The Devil Is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models, 2024, [arXiv:cs/2406.10130]. [CrossRef]
- Chandna, B.; Bashir, Z.; Sen, P. Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective, 2025, [arXiv:cs/2506.05166]. [CrossRef]
- Zhou, Z.; Yu, H.; Zhang, X.; Xu, R.; Huang, F.; Wang, K.; Liu, Y.; Fang, J.; Li, Y. On the Role of Attention Heads in Large Language Model Safety. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Niu, J.; Yuan, X.; Wang, T.; Saghir, H.; Abdi, A.H. Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 16218–16239. [CrossRef]
- Ahsan, H.; Sharma, A.S.; Amir, S.; Bau, D.; Wallace, B.C. Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare. http://arxiv.org/abs/2502.13319, 2025, [arXiv:cs/2502.13319]. [CrossRef]
- Raimondi, B.; Dalbagno, D.; Gabbrielli, M. Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability, 2025, [arXiv:cs/2510.12229]. [CrossRef]
- Gao, C.; Chen, H.; Xiao, C.; Chen, Z.; Liu, Z.; Sun, M. H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs, 2025, [arXiv:cs.AI/2512.01797].
- Pach, M.; Karthik, S.; Bouniot, Q.; Belongie, S.; Akata, Z. Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Stoehr, N.; Du, K.; Snæbjarnarson, V.; West, R.; Cotterell, R.; Schein, A. Activation Scaling for Steering and Interpreting Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 8189–8200. [CrossRef]
- Yao, Y.; Liu, S.; Liu, Z.; Li, Q.; Liu, M.; Han, X.; Guo, Z.; Wu, H.; Song, L. Activation-Guided Consensus Merging for Large Language Models. arXiv preprint arXiv:2505.14009 2025.
- Wang, M.; Chen, X.; Wang, Y.; He, Z.; Xu, J.; Liang, T.; Liu, Q.; Yao, Y.; Wang, W.; Ma, R.; et al. Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Meng, K.; Sharma, A.S.; Andonian, A.J.; Belinkov, Y.; Bau, D. Mass-Editing Memory in a Transformer. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Zhong, M.; An, C.; Chen, W.; Han, J.; He, P. Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective. In Proceedings of the The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.
- Zhang, Z.; Liu, B.; Shao, J. Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Rogers, A.; Boyd-Graber, J.; Okazaki, N., Eds., Toronto, Canada, 2023; pp. 1701–1713. [CrossRef]
- Xu, H.; Zhan, R.; Ma, Y.; Wong, D.F.; Chao, L.S. Let’s Focus on Neuron: Neuron-Level Supervised Fine-tuning for Large Language Model. In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics; Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B.D.; Schockaert, S., Eds., Abu Dhabi, UAE, 2025; pp. 9393–9406.
- Xi, Z.; Zheng, R.; Gui, T.; Zhang, Q.; Huang, X. Efficient Adversarial Training with Robust Early-Bird Tickets. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022; Goldberg, Y.; Kozareva, Z.; Zhang, Y., Eds. Association for Computational Linguistics, 2022, pp. 8318–8331. [CrossRef]
- Zhou, Y.; Chen, W.; Zheng, R.; Xi, Z.; Gui, T.; Zhang, Q.; Huang, X. ORTicket: Let One Robust BERT Ticket Transfer across Different Tasks. In Proceedings of the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy; Calzolari, N.; Kan, M.; Hoste, V.; Lenci, A.; Sakti, S.; Xue, N., Eds. ELRA and ICCL, 2024, pp. 12527–12538.
- Zhang, X.; Liang, Y.; Meng, F.; Zhang, S.; Chen, Y.; Xu, J.; Zhou, J. Multilingual Knowledge Editing with Language-Agnostic Factual Neurons. In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025; Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B.D.; Schockaert, S., Eds. Association for Computational Linguistics, 2025, pp. 5775–5788.
- Li, S.; Yao, L.; Zhang, L.; Li, Y. Safety Layers in Aligned Large Language Models: The Key to LLM Security. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Li, X.; Li, Z.; Kosuga, Y.; Yoshida, Y.; Bian, V. Precision Knowledge Editing: Enhancing Safety in Large Language Models. CoRR 2024, abs/2410.03772, [2410.03772]. [CrossRef]
- Zhang, W.; Wan, C.; Zhang, Y.; ming Cheung, Y.; Tian, X.; Shen, X.; Ye, J. Interpreting and Improving Large Language Models in Arithmetic Calculation. In Proceedings of the Forty-first International Conference on Machine Learning, 2024.
- Zhu, S.; Pan, L.; Li, B.; Xiong, D. LANDeRMT: Dectecting and Routing Language-Aware Neurons for Selectively Finetuning LLMs to Machine Translation. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 12135–12148. [CrossRef]
- Tang, H.; Lin, Y.; Lin, J.; Han, Q.; Ke, D.; Hong, S.; Yao, Y.; Wang, G. RazorAttention: Efficient KV Cache Compression Through Retrieval Heads. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Hooper, C.; Kim, S.; Mohammadzadeh, H.; Mahoney, M.W.; Shao, Y.S.; Keutzer, K.; Gholami, A. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; Zhang, C., Eds. Curran Associates, Inc., 2024, Vol. 37, pp. 1270–1303. [CrossRef]
- Bondarenko, Y.; Nagel, M.; Blankevoort, T. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds. Curran Associates, Inc., 2023, Vol. 36, pp. 75067–75096.
- Chen, Y.; Shang, J.; Zhang, Z.; Xie, Y.; Sheng, J.; Liu, T.; Wang, S.; Sun, Y.; Wu, H.; Wang, H. Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 28241–28259. [CrossRef]
- Xia, H.; Leong, C.T.; Wang, W.; Li, Y.; Li, W. TokenSkip: Controllable Chain-of-Thought Compression in LLMs. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 3351–3363. [CrossRef]
- Tan, Y.; Wang, M.; He, S.; Liao, H.; Zhao, C.; Lu, Q.; Liang, T.; Zhao, J.; Liu, K. Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies, 2025, [arXiv:cs.LG/2512.19673].
- Lee, A.; Bai, X.; Pres, I.; Wattenberg, M.; Kummerfeld, J.K.; Mihalcea, R. A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity. In Proceedings of the Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024.
- Rimsky, N.; Gabrieli, N.; Schulz, J.; Tong, M.; Hubinger, E.; Turner, A. Steering Llama 2 via Contrastive Activation Addition. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 15504–15522. [CrossRef]
- van der Weij, T.; Poesio, M.; Schoots, N. Extending activation steering to broad skills and multiple behaviours. arXiv preprint arXiv:2403.05767 2024.
- Lu, D.; Rimsky, N. Investigating bias representations in Llama 2 Chat via activation steering. arXiv preprint arXiv:2402.00402 2024.
- Postmus, J.; Abreu, S. Steering large language models using conceptors: Improving addition-based activation engineering. arXiv preprint arXiv:2410.16314 2024.
- Turner, A.M.; Thiergart, L.; Leech, G.; Udell, D.; Vazquez, J.J.; Mini, U.; MacDiarmid, M. Steering Language Models With Activation Engineering, 2024, [arXiv:cs.CL/2308.10248].
- Sharma, V.; Raman, V. Steering Conceptual Bias via Transformer Latent-Subspace Activation, 2025, [arXiv:cs.AI/2506.18887].
- Wang, M.; Xu, Z.; Mao, S.; Deng, S.; Tu, Z.; Chen, H.; Zhang, N. Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms. arXiv preprint arXiv:2505.20322 2025.
- Bayat, R.; Rahimi-Kalahroudi, A.; Pezeshki, M.; Chandar, S.; Vincent, P. Steering large language model activations in sparse spaces. arXiv preprint arXiv:2503.00177 2025.
- Weng, J.; Zheng, H.; Zhang, H.; He, Q.; Tao, J.; Xue, H.; Chu, Z.; Wang, X. Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework. arXiv preprint arXiv:2509.18127 2025.
- He, Z.; Zhao, H.; Qiao, Y.; Yang, F.; Payani, A.; Ma, J.; Du, M. Saif: A sparse autoencoder framework for interpreting and steering instruction following of language models. arXiv preprint arXiv:2502.11356 2025.
- Soo, S.; Guang, C.; Teng, W.; Balaganesh, C.; Guoxian, T.; Ming, Y. Interpretable Steering of Large Language Models with Feature Guided Activation Additions. arXiv preprint arXiv:2501.09929 2025.
- Ilharco, G.; Ribeiro, M.T.; Wortsman, M.; Schmidt, L.; Hajishirzi, H.; Farhadi, A. Editing models with task arithmetic. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Yadav, P.; Tam, D.; Choshen, L.; Raffel, C.A.; Bansal, M. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems 2023, 36, 7093–7115.
- Chen, J.; Wang, X.; Yao, Z.; Bai, Y.; Hou, L.; Li, J. Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons, 2025.
- Suau, X.; Delobelle, P.; Metcalf, K.; Joulin, A.; Apostoloff, N.; Zappella, L.; Rodriguez, P. Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models. In Proceedings of the Forty-first International Conference on Machine Learning, 2024.
- Templeton, A.; Conerly, T.; Marcus, J.; Lindsey, J.; Bricken, T.; Chen, B.; Pearce, A.; Citro, C.; Ameisen, E.; Jones, A.; et al. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread 2024.
- Zhao, Y.; Zhang, W.; Xie, Y.; Goyal, A.; Kawaguchi, K.; Shieh, M. Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Arditi, A.; Obeso, O.B.; Syed, A.; Paleka, D.; Rimsky, N.; Gurnee, W.; Nanda, N. Refusal in Language Models Is Mediated by a Single Direction. In Proceedings of the The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
- Zhao, J.; Huang, J.; Wu, Z.; Bau, D.; Shi, W. LLMs Encode Harmfulness and Refusal Separately. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Yin, Q.; Leong, C.T.; Yang, L.; Huang, W.; Li, W.; Wang, X.; Yoon, J.; YunXing.; XingYu.; Gu, J. Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?, 2025, [arXiv:cs.AI/2510.06036].
- Ball, S.; Kreuter, F.; Panickssery, N. Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models, 2024, [arXiv:cs.CL/2406.09289].
- Wang, X.; Hu, C.; Röttger, P.; Plank, B. Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Wang, X.; Wang, M.; Liu, Y.; Schuetze, H.; Plank, B. Refusal Direction is Universal Across Safety-Aligned Languages. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Huang, J.; Tao, J.; Icard, T.; Yang, D.; Potts, C. Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors, 2025, [arXiv:cs.LG/2505.11770].
- Chen, S.; Xiong, M.; Liu, J.; Wu, Z.; Xiao, T.; Gao, S.; He, J. In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation. In Proceedings of the Forty-first International Conference on Machine Learning, 2024.
- Stolfo, A.; Balachandran, V.; Yousefi, S.; Horvitz, E.; Nushi, B. Improving Instruction-Following in Language Models through Activation Steering. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Jiang, G.; Li, Z.; Lian, D.; Wei, Y. Refine Large Language Model Fine-tuning via Instruction Vector. arXiv preprint arXiv:2406.12227 2024.
- Li, J.; Ye, H.; Chen, Y.; Li, X.; Zhang, L.; Alinejad-Rokny, H.; Peng, J.C.H.; Yang, M. Training Superior Sparse Autoencoders for Instruct Models. arXiv preprint arXiv:2506.07691 2025.
- Cai, Y.; Cao, D.; Guo, R.; Wen, Y.; Liu, G.; Chen, E. Locating and Mitigating Gender Bias in Large Language Models, 2024, [arXiv:cs/2403.14409]. [CrossRef]
- Guan, X.; Lin, P.; Wu, Z.; Wang, Z.; Zhang, R.; Kazim, E.; Koshiyama, A. MPF: Aligning and Debiasing Language Models Post Deployment via Multi Perspective Fusion, 2025, [arXiv:cs/2507.02595]. [CrossRef]
- Chintam, A.; Beloch, R.; Zuidema, W.; Hanna, M.; Van Der Wal, O. Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model. In Proceedings of the Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Singapore, 2023; pp. 379–394. [CrossRef]
- Potertì, D.; Seveso, A.; Mercorio, F. Can Role Vectors Affect LLM Behaviour? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 17735–17747. [CrossRef]
- Chen, R.; Arditi, A.; Sleight, H.; Evans, O.; Lindsey, J. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509 2025.
- Handa, G.; Wu, Z.; Koshiyama, A.; Treleaven, P.C. Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects. In Proceedings of the NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025.
- Su, Y.; Zhang, J.; Yang, S.; Wang, X.; Hu, L.; Wang, D. Understanding How Value Neurons Shape the Generation of Specified Values in LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 9433–9452. [CrossRef]
- Deng, J.; Tang, T.; Yin, Y.; yang, W.; Zhao, X.; Wen, J.R. Neuron based Personality Trait Induction in Large Language Models. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Chen, W.; Huang, Z.; Xie, L.; Lin, B.; Li, H.; Lu, L.; Tian, X.; Cai, D.; Zhang, Y.; Wang, W.; et al. From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning. In Proceedings of the Forty-first International Conference on Machine Learning, 2024.
- Tak, A.N.; Banayeeanzade, A.; Bolourani, A.; Kian, M.; Jia, R.; Gratch, J. Mechanistic Interpretability of Emotion Inference in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 13090–13120. [CrossRef]
- Yuan, S.; Qu, Z.; Tawfelis, M.; Färber, M. From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic Tasks. arXiv preprint arXiv:2508.02502 2025.
- Ju, T.; Shao, Z.; Wang, B.; Chen, Y.; Zhang, Z.; Fei, H.; Lee, M.L.; Hsu, W.; Duan, S.; Liu, G. Probing then Editing Response Personality of Large Language Models. In Proceedings of the Second Conference on Language Modeling, 2025.
- Karny, S.; Baez, A.; Pataranutaporn, P. Neural Transparency: Mechanistic Interpretability Interfaces for Anticipating Model Behaviors for Personalized AI. arXiv preprint arXiv:2511.00230 2025.
- Banayeeanzade, A.; Tak, A.N.; Bahrani, F.; Bolourani, A.; Blas, L.; Ferrara, E.; Gratch, J.; Karimireddy, S.P. Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness. arXiv preprint arXiv:2510.04484 2025.
- Zhao, Y.; Zhang, W.; Chen, G.; Kawaguchi, K.; Bing, L. How do Large Language Models Handle Multilingualism? In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; Zhang, C., Eds. Curran Associates, Inc., 2024, Vol. 37, pp. 15296–15319. [CrossRef]
- Liu, Y.; Chen, R.; Hirlimann, L.; Hakimi, A.D.; Wang, M.; Kargaran, A.H.; Rothe, S.; Yvon, F.; Schuetze, H. On Relation-Specific Neurons in Large Language Models. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 992–1022. [CrossRef]
- Jing, Y.; Yao, Z.; Guo, H.; Ran, L.; Wang, X.; Hou, L.; Li, J. LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 28220–28239. [CrossRef]
- Brinkmann, J.; Wendler, C.; Bartelt, C.; Mueller, A. Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages. In Proceedings of the Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Chiruzzo, L.; Ritter, A.; Wang, L., Eds., Albuquerque, New Mexico, 2025; pp. 6131–6150. [CrossRef]
- Philippy, F.; Guo, S.; Haddadan, S. Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space. In Proceedings of the Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP; Beinborn, L.; Goswami, K.; Muradoğlu, S.; Sorokin, A.; Kumar, R.; Shcherbakov, A.; Ponti, E.M.; Cotterell, R.; Vylomova, E., Eds., Dubrovnik, Croatia, 2023; pp. 22–29. [CrossRef]
- Mousi, B.; Durrani, N.; Dalvi, F.; Hawasly, M.; Abdelali, A. Exploring Alignment in Shared Cross-lingual Spaces. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 6326–6348. [CrossRef]
- Chi, Z.; Huang, H.; Mao, X.L. Can Cross-Lingual Transferability of Multilingual Transformers Be Activated Without End-Task Data? In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023; Rogers, A.; Boyd-Graber, J.; Okazaki, N., Eds., Toronto, Canada, 2023; pp. 12572–12584. [CrossRef]
- Hinck, M.; Holtermann, C.; Olson, M.L.; Schneider, F.; Yu, S.; Bhiwandiwalla, A.; Lauscher, A.; Tseng, S.Y.; Lal, V. Why do LLaVA Vision-Language Models Reply to Images in English? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 13402–13421. [CrossRef]
- Zhang, H.; Shang, C.; Wang, S.; Zhang, D.; Yu, Y.; Yao, F.; Sun, R.; Yang, Y.; Wei, F. ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 4818–4841. [CrossRef]
- Wang, M.; Adel, H.; Lange, L.; Liu, Y.; Nie, E.; Strötgen, J.; Schuetze, H. Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 5075–5094. [CrossRef]
- Wang, M.; Lange, L.; Adel, H.; Ma, Y.; Strötgen, J.; Schuetze, H. Language Mixing in Reasoning Language Models: Patterns, Impact, and Internal Causes. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 2637–2665. [CrossRef]
- Liu, Y.; Wang, M.; Kargaran, A.H.; Körner, F.; Nie, E.; Plank, B.; Yvon, F.; Schuetze, H. Tracing Multilingual Factual Knowledge Acquisition in Pretraining. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 2121–2146. [CrossRef]
- Chen, Y.; Cao, P.; Chen, Y.; Liu, K.; Zhao, J. Knowledge Localization: Mission Not Accomplished? Enter Query Localization! In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Katz, S.; Belinkov, Y.; Geva, M.; Wolf, L. Backward Lens: Projecting Language Model Gradients into the Vocabulary Space. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y., Eds. Association for Computational Linguistics, 2024, pp. 2390–2422. [CrossRef]
- Lai, W.; Fraser, A.; Titov, I. Joint Localization and Activation Editing for Low-Resource Fine-Tuning. In Proceedings of the Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025.
- Muhamed, A.; Bonato, J.; Diab, M.T.; Smith, V. SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs. CoRR 2025, abs/2504.08192, [2504.08192]. [CrossRef]
- Jin, Z.; Cao, P.; Yuan, H.; Chen, Y.; Xu, J.; Li, H.; Jiang, X.; Liu, K.; Zhao, J. Cutting Off the Head Ends the Conflict: A Mechanism for Interpreting and Mitigating Knowledge Conflicts in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024; Ku, L.; Martins, A.; Srikumar, V., Eds. Association for Computational Linguistics, 2024, pp. 1193–1215. [CrossRef]
- Li, G.; Chen, Y.; Tong, H. Taming Knowledge Conflicts in Language Models. In Proceedings of the Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025. OpenReview.net, 2025.
- Wu, Z.; Arora, A.; Wang, Z.; Geiger, A.; Jurafsky, D.; Manning, C.D.; Potts, C. ReFT: Representation Finetuning for Language Models. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024; Globersons, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.M.; Zhang, C., Eds., 2024.
- Chen, A.; Merullo, J.; Stolfo, A.; Pavlick, E. Transferring Linear Features Across Language Models With Model Stitching. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Yadav, P.; Tam, D.; Choshen, L.; Raffel, C.A.; Bansal, M. TIES-Merging: Resolving Interference When Merging Models. In Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds., 2023.
- Quirke, P.; Barez, F. Understanding Addition in Transformers. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
- Yang, H.; Zhao, Q.; Li, L. Chain-of-Thought in Large Language Models: Decoding, Projection, and Activation, 2024, [arXiv:cs.AI/2412.03944].
- Venhoff, C.; Arcuschin, I.; Torr, P.; Conmy, A.; Nanda, N. Understanding Reasoning in Thinking Language Models via Steering Vectors, 2025, [arXiv:cs.LG/2506.18167].
- Ward, J.; Lin, C.; Venhoff, C.; Nanda, N. Reasoning-Finetuning Repurposes Latent Representations in Base Models. CoRR 2025, abs/2507.12638, [2507.12638]. [CrossRef]
- Troitskii, D.; Pal, K.; Wendler, C.; McDougall, C.S. Internal states before wait modulate reasoning patterns. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 18640–18649. [CrossRef]
- Wang, Z.; Ma, Y.; Xu, C. Eliciting Chain-of-Thought in Base LLMs via Gradient-Based Representation Optimization, 2025, [arXiv:cs.CL/2511.19131].
- Sun, Y.; Stolfo, A.; Sachan, M. Probing for Arithmetic Errors in Language Models. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 2025; pp. 8122–8139. [CrossRef]
- Cywiński, B.; Bussmann, B.; Conmy, A.; Engels, J.; Nanda, N.; Rajamanoharan, S. Can we interpret latent reasoning using current mechanistic interpretability tools?, 2025.
- Højer, B.; Jarvis, O.; Heinrich, S. Improving Reasoning Performance in Large Language Models via Representation Engineering, 2025, [arXiv:cs.LG/2504.19483].
- Tang, X.; Wang, X.; Lv, Z.; Min, Y.; Zhao, W.X.; Hu, B.; Liu, Z.; Zhang, Z. Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 6832–6849. [CrossRef]
- Hong, Y.; Cao, M.; Zhou, D.; Yu, L.; Jin, Z. The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 21565–21585. [CrossRef]
- Zhang, J.; Viteri, S. Uncovering Latent Chain of Thought Vectors in Language Models, 2025, [arXiv:cs.CL/2409.14026].
- Liu, S.; Chen, T.; Lu, P.; Ye, H.; Chen, Y.; Xing, L.; Zou, J. Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute, 2025, [arXiv:cs.LG/2506.15882].
- Sinii, V.; Gorbatovski, A.; Cherepanov, A.; Shaposhnikov, B.; Balagansky, N.; Gavrilov, D. Steering LLM Reasoning Through Bias-Only Adaptation, 2025, [arXiv:cs.LG/2505.18706].
- Li, Z.; Wang, X.; Yang, Y.; Yao, Z.; Xiong, H.; Du, M. Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 10893–10913. [CrossRef]
- Song, W.; Li, Z.; Zhang, L.; Zhao, H.; Du, B. Sparse is enough in fine-tuning pre-trained large language models. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning. JMLR.org, 2024, ICML’24.
- Mondal, S.K.; Sen, S.; Singhania, A.; Jyothi, P. Language-Specific Neurons Do Not Facilitate Cross-Lingual Transfer. In Proceedings of the The Sixth Workshop on Insights from Negative Results in NLP; Drozd, A.; Sedoc, J.; Tafreshi, S.; Akula, A.; Shu, R., Eds., Albuquerque, New Mexico, 2025; pp. 46–62. [CrossRef]
- Gurgurov, D.; van Genabith, J.; Ostermann, S. Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models, 2025, [arXiv:cs.CL/2510.13580].
- Li, Y.; Gao, W.; Yuan, C.; Wang, X. Fine-Tuning is Subgraph Search: A New Lens on Learning Dynamics, 2025, [arXiv:cs.LG/2502.06106].
- Hoogland, J.; Wang, G.; Farrugia-Roberts, M.; Carroll, L.; Wei, S.; Murfet, D. The developmental landscape of in-context learning, 2024. URL https://arxiv. org/abs/2402.02364 2024.
- Minegishi, G.; Furuta, H.; Taniguchi, S.; Iwasawa, Y.; Matsuo, Y. In-Context Meta Learning Induces Multi-Phase Circuit Emergence. In Proceedings of the ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025.
- Nanda, N.; Chan, L.; Lieberum, T.; Smith, J.; Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Liu, Z.; Michaud, E.J.; Tegmark, M. Omnigrok: Grokking Beyond Algorithmic Data. In Proceedings of the The Eleventh International Conference on Learning Representations, 2023.
- Furuta, H.; Minegishi, G.; Iwasawa, Y.; Matsuo, Y. Towards Empirical Interpretation of Internal Circuits and Properties in Grokked Transformers on Modular Polynomials. Transactions on Machine Learning Research 2024.
- Qiye, H.; Hao, Z.; RuoXi, Y. Exploring Grokking: Experimental and Mechanistic Investigations. arXiv preprint arXiv:2412.10898 2024.
- Li, Z.; Fan, C.; Zhou, T. Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test. arXiv preprint arXiv:2506.21551 2025.
- Lei, L.; Gu, J.; Ma, X.; Tang, C.; Chen, J.; Xu, T. Generic Token Compression in Multimodal Large Language Models from an Explainability Perspective, 2025, [arXiv:cs.CV/2506.01097].
- Guo, Z.; Kamigaito, H.; Watanabe, T. Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 21158–21166. [CrossRef]
- Ye, W.; Wu, Q.; Lin, W.; Zhou, Y. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 22128–22136.
- Xiao, G.; Tang, J.; Zuo, J.; Guo, J.; Yang, S.; Tang, H.; Fu, Y.; Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819 2024.
- Laitenberger, F.; Kopiczko, D.; Snoek, C.G.M.; Asano, Y.M. What Layers When: Learning to Skip Compute in LLMs with Residual Gates, 2025, [arXiv:cs.CL/2510.13876].
- Valade, F. Accelerating Large Language Model Inference with Self-Supervised Early Exits, 2024, [arXiv:cs.CL/2407.21082].
- Wang, H.; Wang, Y.; Liu, T.; Zhao, T.; Gao, J. HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023; Bouamor, H.; Pino, J.; Bali, K., Eds., Singapore, 2023; pp. 4283–4294. [CrossRef]
- Shelke, A.; Savant, R.; Joshi, R. Towards Building Efficient Sentence BERT Models using Layer Pruning. In Proceedings of the Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, 2024, pp. 720–725.
- Lu, X.; Liu, Q.; Xu, Y.; Zhou, A.; Huang, S.; Zhang, B.; Yan, J.; Li, H. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. arXiv preprint arXiv:2402.14800 2024.
- Tan, S.; Wu, D.; Monz, C. Neuron Specialization: Leveraging Intrinsic Task Modularity for Multilingual Machine Translation. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 6506–6527. [CrossRef]
- Xiao, H.; Yang, Q.; Xie, D.; Xu, W.; Zhou, W.; Liu, H.; Liu, Z.; Wong, N. Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models. arXiv preprint arXiv:2508.03332 2025.
- Ranjan, N.; Savakis, A. Mix-QViT: Mixed-precision vision transformer quantization driven by layer importance and quantization sensitivity. arXiv preprint arXiv:2501.06357 2025.
- Zeng, B.; Ji, B.; Liu, X.; Yu, J.; Li, S.; Ma, J.; Li, X.; Wang, S.; Hong, X.; Tang, Y. Lsaq: Layer-specific adaptive quantization for large language model deployment. arXiv preprint arXiv:2412.18135 2024.
- Li, Z.Z.; Zhang, D.; Zhang, M.L.; Zhang, J.; Liu, Z.; Yao, Y.; Xu, H.; Zheng, J.; Wang, P.J.; Chen, X.; et al. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419 2025.
- Ren, Z.; Shao, Z.; Song, J.; Xin, H.; Wang, H.; Zhao, W.; Zhang, L.; Fu, Z.; Zhu, Q.; Yang, D.; et al. Deepseek-prover-v2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition. arXiv preprint arXiv:2504.21801 2025.
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv e-prints 2024, pp. arXiv–2407.
- OpenAI.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report, 2024, [arXiv:cs.CL/2303.08774].
- Liang, X.; Li, Z.Z.; Gong, Y.; Wang, Y.; Zhang, H.; Shen, Y.; Wu, Y.N.; Chen, W. SwS: Self-aware Weakness-driven Problem Synthesis in Reinforcement Learning for LLM Reasoning. arXiv preprint arXiv:2506.08989 2025.
- Qin, L.; Chen, Q.; Zhou, Y.; Chen, Z.; Li, Y.; Liao, L.; Li, M.; Che, W.; Yu, P.S. A survey of multilingual large language models. Patterns 2025, 6.
- Yang, H.; Chen, H.; Guo, H.; Chen, Y.; Lin, C.S.; Hu, S.; Hu, J.; Wu, X.; Wang, X. LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models. arXiv preprint arXiv:2501.05464 2024.
- Chang, Y.; Li, Z.; Zhang, H.; Kong, Y.; Wu, Y.; Guo, Z.; Wong, N. TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review. arXiv preprint arXiv:2506.07642 2025.
- Li, S.; Yang, C.; Wu, T.; Shi, C.; Zhang, Y.; Zhu, X.; Cheng, Z.; Cai, D.; Yu, M.; Liu, L.; et al. A Survey on the Honesty of Large Language Models. arXiv preprint arXiv:2409.18786 2024.
- Zhao, H.; Liu, Z.; Wu, Z.; Li, Y.; Yang, T.; Shu, P.; Xu, S.; Dai, H.; Zhao, L.; Mai, G.; et al. Revolutionizing finance with llms: An overview of applications and insights. arXiv preprint arXiv:2401.11641 2024.
- Yu, Y.; Zhang, Y.; Zhang, D.; Liang, X.; Zhang, H.; Zhang, X.; Khademi, M.; Awadalla, H.H.; Wang, J.; Yang, Y.; et al. Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 24914–24937. [CrossRef]
- Qwen.; :.; Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; et al. Qwen2.5 Technical Report, 2025, [arXiv:cs.CL/2412.15115].
- Huang, B.; Wu, X.; Zhou, Y.; Wu, J.; Feng, L.; Cheng, R.; Tan, K.C. Exploring the true potential: Evaluating the black-box optimization capability of large language models. arXiv preprint arXiv:2404.06290 2024.
- Hong, J.; Tu, Q.; Chen, C.; Xing, G.; Zhang, J.; Yan, R. Cyclealign: Iterative distillation from black-box llm to white-box models for better human alignment. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 14596–14609.
- Zhang, J.; Ding, M.; Liu, Y.; Hong, J.; Tramèr, F. Black-box Optimization of LLM Outputs by Asking for Directions. arXiv preprint arXiv:2510.16794 2025.
- Ferrando, J.; Sarti, G.; Bisazza, A.; Costa-Jussà, M.R. A primer on the inner workings of transformer-based language models. arXiv preprint arXiv:2405.00208 2024.
- Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology 2024, 15, 1–38.
- Räuker, T.; Ho, A.; Casper, S.; Hadfield-Menell, D. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In Proceedings of the 2023 ieee conference on secure and trustworthy machine learning (satml). IEEE, 2023, pp. 464–483.
- Allen-Zhu, Z.; Li, Y. Physics of language models: Part 1, learning hierarchical language structures. arXiv preprint arXiv:2305.13673 2023.
- Zheng, Z.; Wang, Y.; Huang, Y.; Song, S.; Yang, M.; Tang, B.; Xiong, F.; Li, Z. Attention heads of large language models: A survey. arXiv preprint arXiv:2409.03752 2024.
- Saphra, N.; Wiegreffe, S. Mechanistic? arXiv preprint arXiv:2410.09087 2024.
- López-Otal, M.; Gracia, J.; Bernad, J.; Bobed, C.; Pitarch-Ballesteros, L.; Anglés-Herrero, E. Linguistic Interpretability of Transformer-based Language Models: a systematic review. arXiv preprint arXiv:2504.08001 2025.
- Gantla, S.R. Exploring Mechanistic Interpretability in Large Language Models: Challenges, Approaches, and Insights. In Proceedings of the 2025 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI). IEEE, 2025, pp. 1–8.
- Luo, H.; Specia, L. From understanding to utilization: A survey on explainability for large language models. arXiv preprint arXiv:2401.12874 2024.
- Wu, X.; Zhao, H.; Zhu, Y.; Shi, Y.; Yang, F.; Hu, L.; Liu, T.; Zhai, X.; Yao, W.; Li, J.; et al. Usable XAI: 10 strategies towards exploiting explainability in the LLM era. arXiv preprint arXiv:2403.08946 2024.
- Rai, D.; Zhou, Y.; Feng, S.; Saparov, A.; Yao, Z. A practical review of mechanistic interpretability for transformer-based language models. arXiv preprint arXiv:2407.02646 2024.
- Bereska, L.; Gavves, E. Mechanistic interpretability for AI safety–a review. arXiv preprint arXiv:2404.14082 2024.
- Lee, S.; Cho, A.; Kim, G.C.; Peng, S.; Phute, M.; Chau, D.H. Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 21514–21545. [CrossRef]
- Resck, L.; Augenstein, I.; Korhonen, A. Explainability and interpretability of multilingual large language models: A survey. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 20465–20497.
- Lin, Z.; Basu, S.; Beigi, M.; Manjunatha, V.; Rossi, R.A.; Wang, Z.; Zhou, Y.; Balasubramanian, S.; Zarei, A.; Rezaei, K.; et al. A survey on mechanistic interpretability for multi-modal foundation models. arXiv preprint arXiv:2502.17516 2025.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9.
- Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063.
- Bricken, T.; Pehlevan, C. Attention approximates sparse distributed memory. Advances in Neural Information Processing Systems 2021, 34, 15301–15315.
- Li, R.; Gao, Y. Anchored Answers: Unravelling Positional Bias in GPT-2’s Multiple-Choice Questions, 2025, [arXiv:cs/2405.03205]. [CrossRef]
- Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; Titov, I. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. In Proceedings of the Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5797–5808.
- Feng, J.; Steinhardt, J. How do language models bind entities in context? In Proceedings of the NeurIPS 2023 Workshop on Symmetry and Geometry in Neural Representations, 2023.
- Men, T.; Cao, P.; Jin, Z.; Chen, Y.; Liu, K.; Zhao, J. Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 7713–7724.
- Geva, M.; Caciularu, A.; Wang, K.; Goldberg, Y. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the Proceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 30–45.
- Shazeer, N. GLU Variants Improve Transformer, 2020, [arXiv:cs.LG/2002.05202].
- Elhage, N.; Hume, T.; Olsson, C.; Schiefer, N.; Henighan, T.; Kravec, S.; Hatfield-Dodds, Z.; Lasenby, R.; Drain, D.; Chen, C.; et al. Toy Models of Superposition. CoRR 2022, abs/2209.10652, [2209.10652]. [CrossRef]
- Lieberum, T.; Rajamanoharan, S.; Conmy, A.; Smith, L.; Sonnerat, N.; Varma, V.; Kramar, J.; Dragan, A.; Shah, R.; Nanda, N. Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. In Proceedings of the Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP; Belinkov, Y.; Kim, N.; Jumelet, J.; Mohebbi, H.; Mueller, A.; Chen, H., Eds., Miami, Florida, US, 2024; pp. 278–300. [CrossRef]
- He, Z.; Shu, W.; Ge, X.; Chen, L.; Wang, J.; Zhou, Y.; Liu, F.; Guo, Q.; Huang, X.; Wu, Z.; et al. Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders. arXiv preprint arXiv:2410.20526 2024.
- Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600 2023.
- Bloom, J. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2 Small, 2024.
- Ghilardi, D.; Belotti, F.; Molinari, M. Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups. arXiv preprint arXiv:2410.21508 2024.
- Mudide, A.; Engels, J.; Michaud, E.J.; Tegmark, M.; de Witt, C.S. Efficient dictionary learning with switch sparse autoencoders. arXiv preprint arXiv:2410.08201 2024.
- Xu, Z.; Tan, Z.; Wang, S.; Xu, K.; Chen, T. Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder. arXiv preprint arXiv:2511.05745 2025.
- Cho, I.; Hockenmaier, J. Toward Efficient Sparse Autoencoder-Guided Steering for Improved In-Context Learning in Large Language Models. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 28949–28961.
- Gao, L.; la Tour, T.D.; Tillman, H.; Goh, G.; Troll, R.; Radford, A.; Sutskever, I.; Leike, J.; Wu, J. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093 2024.
- Rajamanoharan, S.; Conmy, A.; Smith, L.; Lieberum, T.; Varma, V.; Kramár, J.; Shah, R.; Nanda, N. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014 2024.
- Bussmann, B.; Leask, P.; Nanda, N. Batchtopk sparse autoencoders. arXiv preprint arXiv:2412.06410 2024.
- Rajamanoharan, S.; Lieberum, T.; Sonnerat, N.; Conmy, A.; Varma, V.; Kramár, J.; Nanda, N. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435 2024.
- Cho, H.; Yang, H.; Kurkoski, B.M.; Inoue, N. Binary Autoencoder for Mechanistic Interpretability of Large Language Models. arXiv preprint arXiv:2509.20997 2025.
- Kim, J.; Evans, J.; Schein, A. Linear Representations of Political Perspective Emerge in Large Language Models. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Wang, Z.; Zhang, H.; Li, X.; Huang, K.H.; Han, C.; Ji, S.; Kakade, S.M.; Peng, H.; Ji, H. Eliminating Position Bias of Language Models: A Mechanistic Approach, 2025, [arXiv:cs/2407.01100]. [CrossRef]
- Yu, Y.; Jiang, H.; Luo, X.; Wu, Q.; Lin, C.Y.; Li, D.; Yang, Y.; Huang, Y.; Qiu, L. Mitigate Position Bias in LLMs via Scaling a Single Hidden States Channel. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 6092–6111. [CrossRef]
- Dimino, F.; Saxena, K.; Sarmah, B.; Pasquali, S. Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5. In Proceedings of the Proceedings of the 6th ACM International Conference on AI in Finance, 2025, pp. 96–104, [arXiv:q-fin/2508.18427]. [CrossRef]
- Pai, T.M.; Wang, J.I.; Lu, L.C.; Sun, S.H.; Lee, H.Y.; Chang, K.W. Billy: Steering large language models via merging persona vectors for creative generation. arXiv preprint arXiv:2510.10157 2025.
- Schwartz, S.H. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In Advances in experimental social psychology; Elsevier, 1992; Vol. 25, pp. 1–65.
- Joshi, N.; Rando, J.; Saparov, A.; Kim, N.; He, H. Personas as a way to model truthfulness in language models. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 6346–6359.
- Ghandeharioun, A.; Yuan, A.; Guerard, M.; Reif, E.; Lepori, M.A.; Dixon, L. Who’s asking? User personas and the mechanics of latent misalignment. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; Zhang, C., Eds. Curran Associates, Inc., 2024, Vol. 37, pp. 125967–126003. [CrossRef]
- Kojima, T.; Okimura, I.; Iwasawa, Y.; Yanaka, H.; Matsuo, Y. On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons. In Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Duh, K.; Gomez, H.; Bethard, S., Eds., Mexico City, Mexico, 2024; pp. 6919–6971. [CrossRef]
- Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021, [arXiv:cs.CL/2101.00190].
- Hu, E.J.; yelong shen.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, 2022.
- Liu, X.; Ji, K.; Fu, Y.; Tam, W.; Du, Z.; Yang, Z.; Tang, J. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Muresan, S.; Nakov, P.; Villavicencio, A., Eds., Dublin, Ireland, 2022; pp. 61–68. [CrossRef]
- Varma, V.; Shah, R.; Kenton, Z.; Kramár, J.; Kumar, R. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390 2023.
- Huang, Y.; Hu, S.; Han, X.; Liu, Z.; Sun, M. Unified View of Grokking, Double Descent and Emergent Abilities: A Comprehensive Study on Algorithm Task. In Proceedings of the First Conference on Language Modeling, 2024.
- Notsawo Jr, P.; Zhou, H.; Pezeshki, M.; Rish, I.; Dumas, G.; et al. Predicting grokking long before it happens: A look into the loss landscape of models which grok. arXiv preprint arXiv:2306.13253 2023.
- Xiong, J.; Shen, J.; Ye, F.; Tao, C.; Wan, Z.; Lu, J.; Wu, X.; Zheng, C.; Guo, Z.; Kong, L.; et al. UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference. arXiv preprint arXiv:2410.03090 2024.
- Xiong, J.; Shen, J.; Zheng, C.; Wan, Z.; Zhao, C.; Yang, C.; Ye, F.; Yang, H.; Kong, L.; Wong, N. ParallelComp: Parallel Long-Context Compressor for Length Extrapolation. arXiv preprint arXiv:2502.14317 2025.
- Kharlapenko, D.; Shabalin, S.; Barez, F.; Conmy, A.; Nanda, N. Scaling sparse feature circuit finding for in-context learning. arXiv preprint arXiv:2504.13756 2025.
- Nikankin, Y.; Arad, D.; Gandelsman, Y.; Belinkov, Y. Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs. arXiv preprint arXiv:2506.09047 2025.
- Duan, X.; Zhou, X.; Xiao, B.; Cai, Z. Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability. In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics; Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B.D.; Schockaert, S., Eds., Abu Dhabi, UAE, 2025; pp. 10148–10157.
- He, Z.; Ge, X.; Tang, Q.; Sun, T.; Cheng, Q.; Qiu, X. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello-gpt. arXiv preprint arXiv:2402.12201 2024.
- Marks, S.; Rager, C.; Michaud, E.J.; Belinkov, Y.; Bau, D.; Mueller, A. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Lindsey, J.; Gurnee, W.; Ameisen, E.; Chen, B.; Pearce, A.; Turner, N.L.; Citro, C.; Abrahams, D.; Carter, S.; Hosmer, B.; et al. On the Biology of a Large Language Model. Transformer Circuits Thread 2025.
- Nguyen, T.; Michaels, J.; Fiterau, M.; Jensen, D. Challenges in Understanding Modality Conflict in Vision-Language Models. arXiv preprint arXiv:2509.02805 2025.
- Prakash, N.; Shaham, T.R.; Haklay, T.; Belinkov, Y.; Bau, D. Fine-tuning enhances existing mechanisms: A case study on entity tracking. arXiv preprint arXiv:2402.14811 2024.
- Nanda, N. Attribution patching: Activation patching at industrial scale. URL: https://www. neelnanda. io/mechanistic-interpretability/attribution-patching 2023.
- Yu, Z.; Ananiadou, S. Neuron-Level Knowledge Attribution in Large Language Models. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y., Eds. Association for Computational Linguistics, 2024, pp. 3267–3280. [CrossRef]
- Miller, J.; Chughtai, B.; Saunders, W. Transformer circuit faithfulness metrics are not robust. arXiv preprint arXiv:2407.08734 2024.
- Parrack, A.; Attubato, C.L.; Heimersheim, S. Benchmarking deception probes via black-to-white performance boosts. arXiv preprint arXiv:2507.12691 2025.
- Nguyen, J.; Hoang, K.; Attubato, C.L.; Hofstätter, F. Probing and Steering Evaluation Awareness of Language Models, July 2025. URL http://arxiv. org/abs/2507.01786 2025.
- Wu, Z.; Arora, A.; Geiger, A.; Wang, Z.; Huang, J.; Jurafsky, D.; Manning, C.D.; Potts, C. AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Karvonen, A.; Rager, C.; Lin, J.; Tigges, C.; Bloom, J.; Chanin, D.; Lau, Y.T.; Farrell, E.; McDougall, C.; Ayonrinde, K.; et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. arXiv preprint arXiv:2503.09532 2025.
- Gao, L.; Rajaram, A.; Coxon, J.; Govande, S.V.; Baker, B.; Mossing, D. Weight-sparse transformers have interpretable circuits. arXiv preprint arXiv:2511.13653 2025.
- Yin, F.; Ye, X.; Durrett, G. Lofit: Localized fine-tuning on llm representations. Advances in Neural Information Processing Systems 2024, 37, 9474–9506.
- Jiang, G.; Li, Z.; Jiang, C.; Xue, S.; Zhou, J.; Song, L.; Lian, D.; Wei, Y. Interpretable catastrophic forgetting of large language model fine-tuning via instruction vector. arXiv e-prints 2024, pp. arXiv–2406.
- Zhang, H.; Yang, S.; Liang, X.; Shang, C.; Jiang, Y.; Tao, C.; Xiong, J.; So, H.K.H.; Xie, R.; Chang, A.X.; et al. Find Your Optimal Teacher: Personalized Data Synthesis via Router-Guided Multi-Teacher Distillation. arXiv preprint arXiv:2510.10925 2025.
- Hsueh, C.H.; Huang, P.K.M.; Lin, T.H.; Liao, C.W.; Fang, H.C.; Huang, C.W.; Chen, Y.N. Editing the mind of giants: An in-depth exploration of pitfalls of knowledge editing in large language models. arXiv preprint arXiv:2406.01436 2024.
- Xu, Z.; Wang, S.; Xu, K.; Xu, H.; Wang, M.; Deng, X.; Yao, Y.; Zheng, G.; Chen, H.; Zhang, N. EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models. arXiv preprint arXiv:2504.15133 2025.
- Da Silva, P.Q.; Sethuraman, H.; Rajagopal, D.; Hajishirzi, H.; Kumar, S. Steering off Course: Reliability Challenges in Steering Language Models. arXiv preprint arXiv:2504.04635 2025.
- Braun, J.; Eickhoff, C.; Bahrainian, S.A. Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization. arXiv preprint arXiv:2505.24859 2025.
- Zhang, Z.; Dong, Q.; Zhang, Q.; Zhao, J.; Zhou, E.; Xi, Z.; Jin, S.; Fan, X.; Zhou, Y.; Wu, M.; et al. Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective, 2025, [arXiv:cs.CL/2506.23508].
- Yao, Y.; Zhang, N.; Xi, Z.; Wang, M.; Xu, Z.; Deng, S.; Chen, H. Knowledge circuits in pretrained transformers. Advances in Neural Information Processing Systems 2024, 37, 118571–118602.
- Xiao, H.; Sung, Y.L.; Stengel-Eskin, E.; Bansal, M. Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression. In Proceedings of the Second Conference on Language Modeling, 2025.
- Xiong, J.; Li, C.; Yang, M.; Hu, X.; Hu, B. Expression Syntax Information Bottleneck for Math Word Problems, 2026, [arXiv:cs.CL/2310.15664].
- Xiong, J.; Han, Q.; Hsieh, Y.; Shen, H.; Xin, H.; Tao, C.; Zhao, C.; Zhang, H.; Wu, T.; Zhang, Z.; et al. MMFormalizer: Multimodal Autoformalization in the Wild. arXiv preprint arXiv:2601.03017 2026.
- Morgan, J.J.B.; Gilliland, A.R. An introduction to psychology; Macmillan, 1927.
- Gruber, O.; Goschke, T. Executive control emerging from dynamic interactions between brain systems mediating language, working memory and attentional processes. Acta psychologica 2004, 115, 105–121.
- Gruszka, A.; Matthews, G. Handbook of individual differences in cognition: Attention, memory, and executive control; Springer, 2010.
- Zhang, J. Cognitive functions of the brain: perception, attention and memory. arXiv preprint arXiv:1907.02863 2019.
- Davies, A.; Khakzar, A. The cognitive revolution in interpretability: From explaining behavior to interpreting representations and algorithms. arXiv preprint arXiv:2408.05859 2024.
- Wulff, D.U.; Mata, R. Advancing cognitive science with llms. arXiv preprint arXiv:2511.00206 2025.
- Ren, Y.; Jin, R.; Zhang, T.; Xiong, D. Do Large Language Models Mirror Cognitive Language Processing? In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 2988–3001.
- Conklin, H.; Smith, K. Representations as language: An information-theoretic framework for interpretability. arXiv preprint arXiv:2406.02449 2024.
- Kendiukhov, I. A Review of Developmental Interpretability in Large Language Models. arXiv preprint arXiv:2508.15841 2025.
- Ismail, A.A.; Oikarinen, T.; Wang, A.; Adebayo, J.; Stanton, S.; Joren, T.; Kleinhenz, J.; Goodman, A.; Bravo, H.C.; Cho, K.; et al. Concept bottleneck language models for protein design. arXiv preprint arXiv:2411.06090 2024.
- Sun, C.E.; Oikarinen, T.; Ustun, B.; Weng, T.W. Concept Bottleneck Large Language Models. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2024.
- Shang, C.; Zhang, H.; Wen, H.; Yang, Y. Understanding multimodal deep neural networks: A concept selection view. arXiv preprint arXiv:2404.08964 2024.
- Shang, C.; Zhou, S.; Zhang, H.; Ni, X.; Yang, Y.; Wang, Y. Incremental residual concept bottleneck models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11030–11040.
- Tan, Z.; Cheng, L.; Wang, S.; Yuan, B.; Li, J.; Liu, H. Interpreting pretrained language models via concept bottlenecks. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2024, pp. 56–74.
- Hu, L.; Ren, C.; Hu, Z.; Lin, H.; Wang, C.L.; Xiong, H.; Zhang, J.; Wang, D. Editable Concept Bottleneck Models, 2025, [arXiv:cs.LG/2405.15476].
- Zhao, D.; Huang, Q.; Yan, D.; Sun, Y.; Yu, J. Partially Shared Concept Bottleneck Models. arXiv preprint arXiv:2511.22170 2025.
- Srivastava, D.; Yan, G.; Weng, L. Vlg-cbm: Training concept bottleneck models with vision-language guidance. Advances in Neural Information Processing Systems 2024, 37, 79057–79094.
- Pan, W.; Liu, Z.; Chen, Q.; Zhou, X.; Haining, Y.; Jia, X. The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Xie, W.; Feng, Y.; Gu, S.; Yu, D. Importance-based Neuron Allocation for Multilingual Neural Machine Translation. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Zong, C.; Xia, F.; Li, W.; Navigli, R., Eds., Online, 2021; pp. 5725–5737. [CrossRef]
- Libovický, J.; Rosa, R.; Fraser, A. On the Language Neutrality of Pre-trained Multilingual Representations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020; Cohn, T.; He, Y.; Liu, Y., Eds., Online, 2020; pp. 1663–1674. [CrossRef]
- Wu, Z.; Yu, X.V.; Yogatama, D.; Lu, J.; Kim, Y. The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Lv, A.; Zhang, K.; Chen, Y.; Wang, Y.; Liu, L.; Wen, J.; Xie, J.; Yan, R. Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models. CoRR 2024, abs/2403.19521, [2403.19521]. [CrossRef]
- Muhamed, A.; Smith, V. The Geometry of Forgetting: Analyzing Machine Unlearning through Local Learning Coefficients. In Proceedings of the ICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025.
- Chen, Y.; Cao, P.; Chen, Y.; Liu, K.; Zhao, J. Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada; Wooldridge, M.J.; Dy, J.G.; Natarajan, S., Eds. AAAI Press, 2024, pp. 17817–17825. [CrossRef]
- Kassem, A.M.; Shi, Z.; Rostamzadeh, N.; Farnadi, G. Reviving Your MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing. CoRR 2025, abs/2507.21084, [2507.21084]. [CrossRef]
- Kang, C.; Choi, J. Impact of Co-occurrence on Factual Knowledge of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023; Bouamor, H.; Pino, J.; Bali, K., Eds. Association for Computational Linguistics, 2023, pp. 7721–7735. [CrossRef]
- Jin, M.; Yu, Q.; Huang, J.; Zeng, Q.; Wang, Z.; Hua, W.; Zhao, H.; Mei, K.; Meng, Y.; Ding, K.; et al. Exploring Concept Depth: How Large Language Models Acquire Knowledge and Concept at Different Layers? In Proceedings of the Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025; Rambow, O.; Wanner, L.; Apidianaki, M.; Al-Khalifa, H.; Eugenio, B.D.; Schockaert, S., Eds. Association for Computational Linguistics, 2025, pp. 558–573.
- Yu, Z.; Belinkov, Y.; Ananiadou, S. Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 11268–11283. [CrossRef]
- Akter, M.S.; Shahriar, H.; Cuzzocrea, A.; Wu, F. Uncovering the Interpretation of Large Language Models. In Proceedings of the 48th IEEE Annual Computers, Software, and Applications Conference, COMPSAC 2024, Osaka, Japan, July 2-4, 2024; Shahriar, H.; Ohsaki, H.; Sharmin, M.; Towey, D.; Majumder, A.K.M.J.A.; Hori, Y.; Yang, J.; Takemoto, M.; Sakib, N.; Banno, R.; et al., Eds. IEEE, 2024, pp. 1057–1066. [CrossRef]
- Nikankin, Y.; Reusch, A.; Mueller, A.; Belinkov, Y. Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Biran, E.; Gottesman, D.; Yang, S.; Geva, M.; Globerson, A. Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y., Eds. Association for Computational Linguistics, 2024, pp. 14113–14130. [CrossRef]
- Ye, T.; Xu, Z.; Li, Y.; Allen-Zhu, Z. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.
- Panigrahi, A.; Saunshi, N.; Zhao, H.; Arora, S. Task-Specific Skill Localization in Fine-tuned Language Models, 2023, [ICML:cs.CL/2302.06600].
- Thilak, V.; Littwin, E.; Zhai, S.; Saremi, O.; Paiss, R.; Susskind, J.M. The slingshot mechanism: An empirical study of adaptive optimizers and the∖emph {Grokking Phenomenon}. In Proceedings of the Has it Trained Yet? NeurIPS 2022 Workshop, 2022.
- Wang, B.; Yue, X.; Su, Y.; Sun, H. Grokking of implicit reasoning in transformers: A mechanistic journey to the edge of generalization. Advances in Neural Information Processing Systems 2024, 37, 95238–95265.
- Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.M.; Wang, W.C.; Xiao, G.; Dang, X.; Gan, C.; Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems 2024, 6, 87–100.
- Sun, M.; Chen, X.; Kolter, J.Z.; Liu, Z. Massive activations in large language models. arXiv preprint arXiv:2402.17762 2024.
| 1 | Here, represents the token-wise residual stream state of the input sequence at layer l, with T tokens and hidden dimension . |
| 2 | For simplicity and clarity in our mechanistic analysis, we omit Layer Normalization (LayerNorm or RMSNorm) terms from these equations. While crucial for training stability, normalization operations are often abstracted away in high-level interpretability studies to focus on the additive composition of features. |
| 3 | While we use the standard formulation above to keep notation compact, it is important to note that many modern LLMs employ gated variants such as SwiGLU [302]. These variants introduce an additional gating matrix and combine an element-wise gate with the projection before the final output: . For the sake of generality, we present the standard FFN formulation here. |











| Object | Notation | Shape | |
|---|---|---|---|
| Token Embedding | Embedding Matrix | ||
| Token i Embedding (Input) | |||
| Residual Stream | Residual Stream State | ||
| Intermediate State (Post-Attn) | |||
| MHA | Q, K, V, O Weight Matrices | / | |
| Attention Score Matrix | |||
| Head Output | |||
| Block Output | |||
| FFN | In Projection (Key) Matrix | ||
| Out Projection (Value) Matrix | |||
| Block Output | |||
| Neuron | Neuron Activation State | ||
| j-th Neuron Activation | (Scalar) | ||
| j-th Neuron Key Weight | |||
| j-th Neuron Value Weight | |||
| SAE Feature | Feature Activation State | ||
| j-th Feature Activation | (Scalar) | ||
| j-th Feature |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
