Proceedings of ILIAD

The Proceedings of ILIAD publishes research on theoretical AI alignment presented at the annual ILIAD conference in Berkeley, California.

Proceedings of ILIAD 2: ODYSSEY (2025)

Submissions are now open. Deadline: June 25.

Proceedings of ILIAD (2024)

"An Approach to Giving In to Threats Without Incentivizing Them"
Mikhail Samin
Abstract:
We present how Logical Decision Theory (LDT) can effectively handle threats, ultimatums, and commitments in decision-making scenarios, incentivizing cooperation and fair outcomes. By employing a proposed strategy, LDT agents can often give in to threats (losing far less utility than if they never gave in), but without making themselves exploitable or...
[Expand+]
We present how Logical Decision Theory (LDT) can effectively handle threats, ultimatums, and commitments in decision-making scenarios, incentivizing cooperation and fair outcomes. By employing a proposed strategy, LDT agents can often give in to threats (losing far less utility than if they never gave in), but without making themselves exploitable or incentivizing others to make threats. We illustrate these principles through well-known game theory examples and offer ideas on how these strategies can be applied to broader domains. These ideas contribute to understanding how rational agents could achieve cooperation even when their notions of fair splits of gains are different, offering valuable implications for problems related to commitment races and s-risks.
[Collapse-]
"Factored Space Models: Towards Causality Between Levels of Abstraction Extended Abstract"
Scott Garrabrant, Matthias G. Mayer, Magdalena Wache, Leon Lang, Sam Eisenstat, and Holger Dell
Abstract:
Causality plays an important role in understanding intelligent behavior, and there is a wealth of literature on mathematical models for causality, most of which is focused on causal graphs. Causal graphs are a powerful tool for a wide range of applications, in particular when the relevant variables are known and at the same level of abstraction...
[Expand+]
Causality plays an important role in understanding intelligent behavior, and there is a wealth of literature on mathematical models for causality, most of which is focused on causal graphs. Causal graphs are a powerful tool for a wide range of applications, in particular when the relevant variables are known and at the same level of abstraction. However, the given variables can also be unstructured data, like pixels of an image, and the given variables may be arbitrary functions of the latent causal variables, such as the positions of objects in the image. Moreover, the causal variables may form a hierarchy of abstractions, in which the macro-level variables are deterministic functions of the micro-level variables. Causal graphs are limited when it comes to modeling this kind of situation. In the presence of deterministic relationships there is generally no causal graph that satisfies both the Markov condition and the faithfulness condition. We introduce factored space models as an alternative to causal graphs which naturally represent both probabilistic and deterministic relationships at all levels of abstraction. Moreover, we introduce structural independence and establish that it is equivalent to statistical independence in every distribution that factorizes over the factored space. This theorem generalizes the classical soundness and completeness theorem for d-separation.
[Collapse-]
"Geometric Utilitarianism"
John Little
Abstract:
The foundational von Neumann-Morgenstern utility theorem established expected utility maximization as the mathematical basis for individual rationality. Extending this seminal framework, this paper develops geometric utility maximization as a corresponding basis for group rationality...
[Expand+]
The foundational von Neumann-Morgenstern utility theorem established expected utility maximization as the mathematical basis for individual rationality. Extending this seminal framework, this paper develops geometric utility maximization as a corresponding basis for group rationality. Building upon Garrabrant's work on geometric rationality, we prove that any Pareto optimal outcome can be achieved by maximizing a geometric weighted average of individual utilities. This framework advances beyond classical Harsanyi utility aggregation, naturally favoring compromise solutions over extremes, and these compromises shift continuously as geometric weights shift. We demonstrate how geometric utilitarianism resolves fundamental limitations of classic utilitarianism in multi-agent scenarios, particularly regarding fairness, consent, and externalities. The resulting theory provides a rigorous mathematical framework for ethical decision-making in complex multi-stakeholder environments, with direct applications to artificial intelligence alignment and social choice theory.
[Collapse-]
"The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks"
Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, and Marius Hobbhahn
Abstract:
Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing...
[Expand+]
Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis — the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to language models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models.
[Collapse-]
"Mathematical Models of Computation in Superposition"
Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, and Lawrence Chan
Abstract:
Superposition — when a neural network represents more "features" than it has dimensions — seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies representational superposition, where superposition is only used when passing information through bottlenecks...
[Expand+]
Superposition — when a neural network represents more "features" than it has dimensions — seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies representational superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of computation in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the $\binom{m}{2}$ pairs of each of $m$ features. We construct a 1-layer MLP that uses superposition to perform this task up to $\varepsilon$-error, where the network only requires $Õ(m^{\frac{2}{3}})$ neurons, even when the input features are themselves in superposition. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct "error correction" layers that allow deep fully-connected networks of width $d$ to emulate circuits of width $Õ(d^{1.5})$ and any polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.
[Collapse-]
"Natural Latents: Latent Variables Stable Across Ontologies"
John Wentworth and David Lorell
Abstract:
Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables...
[Expand+]
Suppose two Bayesian agents each learn a generative model of the same environment. We will assume the two have converged on the predictive distribution (i.e. distribution over some observables in the environment), but may have different generative models containing different latent variables. Under what conditions can one agent guarantee that their latents can be faithfully expressed in terms of the other agent's latents?

We give simple conditions under which such translation is guaranteed to be possible: the natural latent conditions. We also show that, absent further constraints, these are the most general conditions under which translatability is guaranteed.

[Collapse-]
"Representation Learning on a Random Lattice"
Aryeh Brill
Abstract:
Decomposing a deep neural network's learned representations into interpretable features could greatly enhance its safety and reliability. To better understand features, we adopt a geometric perspective, viewing them as a learned coordinate system for mapping an embedded data distribution...
[Expand+]
Decomposing a deep neural network's learned representations into interpretable featurescould greatly enhance its safety and reliability. To better understand features, we adopt a geometric perspective, viewing them as a learned coordinate system for mapping an embedded data distribution. We motivate a model of a generic data distribution as a random lattice and analyze its properties using percolation theory. Learned features are categorized into context, component, and surface features. The model is qualitatively consistent with recent findings in mechanistic interpretability and suggests directions for future research.
[Collapse-]
"Singular leaning coefficients and efficiency in learning theory"
Miki Aoyagi
Abstract:
Singular learning models with non-positive Fisher information matrices include neural networks,reduced-rank regression, Boltzmann machines, normal mixture models, and others.These models have been widely used in the development of learning machines. However, theoretical analysis is still in its early stages...
[Expand+]
Singular learning models with non-positive Fisher information matrices include neural networks,reduced-rank regression, Boltzmann machines, normal mixture models, and others.These models have been widely used in the development of learning machines. However, theoretical analysis is still in its early stages.

In this paper, we examine learning coefficients, which indicate the general learning efficiency of deep linear learning models and three-layer neural network models with ReLU units. Finally, we extend the results to include the case of the Softmax function.

[Collapse-]
"Understanding Trust"
Abram Demski, Norman Hsia, and Paul Rapoport
Abstract:
We examine conditions under which agents maintain their properties over time (the properties "tile"). Historically, tiling has been studied in the case of logical agents which must prove that their actions achieve some criteria [YH13]. However, this faced serious obstacles due to Löb's Theorem...
[Expand+]
We examine conditions under which agents maintain their properties over time (the properties "tile"). Historically, tiling has been studied in the case of logical agents which must prove that their actions achieve some criteria [YH13]. However, this faced serious obstacles due to Löb's Theorem. Stuart Armstrong noted that this obstacle does not apply to agents who reason probabilistically [Arm13], but he did not provide a positive tiling result on that basis. The present work sketches basic tiling results for agents who make decisions using expected utility maximization, studying several versions of updateless decision theory.
[Collapse-]