Skip to main content

Designing for Adversaries

Analytic approaches to verification make assumptions that break in practice. Arkhai's research approach: train agents to cheat, then iterate mechanisms to stop them.

February 23, 2026 · Levi Rybalov

Eighteen Months and a Million Dollars: Part 6

Excerpts from the Whitepaper

Key Takeaways

  • Analytic approaches to verification-via-replication make assumptions that break down in practice
  • Much of the academic literature (and many public protocols it inspired) simplifies away real-world complexity: hardware heterogeneity, network topologies, latencies, failures, repeated games, collusion
  • Arkhai's research approach: train agents to cheat, then iterate mechanisms to stop them
  • Multi-agent RL trains attackers and defenders; reward-design ideas (including inverse RL) help search for incentives that elicit desired behavior
  • Honest about limitations: multi-agent training is not guaranteed to converge, and empirical robustness is not a proof
  • Three possible outcomes: defense works, arms race, or learning that a given mechanism class is insufficient under the threat model. All three are valuable.

Note The approach described in this post originated as research, and most of it has not yet been implemented in our production systems.

Why analytic approaches fail

Mathematical proofs are reassuring, but the real world is unforgiving.

In our last post, we covered collateral markets: how the collateral multiplier solves the unknown-cost problem, and how a series of credible commitments handles different types of risk. The mechanism layer is in place.

But mechanisms need to survive contact with adversaries. How do you know a collateral scheme actually prevents cheating? How do you know a verification protocol can't be gamed by colluding nodes?

The traditional approach is to mathematically prove that honest behavior is rational and leads to Nash equilibrium. The problem is that analytic results depend on tractable models, and tractable models depend on simplifications.

In our reading, much of the academic literature on verification-via-replication (and the public protocols it has inspired) does not fully account for the operational complexity of real distributed computing networks, including:

  • different hardware configurations
  • network topologies
  • variable latencies
  • node failures
  • repeated games
  • collusion

Many papers and protocols make assumptions about the environment, action spaces, or failure models that, once relaxed, weaken or eliminate the original theoretical guarantees.

This isn't a criticism of the research. It's the nature of the problem. Real distributed systems violate simplifying assumptions constantly.

For this reason, Arkhai explores an alternative approach that forgoes analytic guarantees in favor of empirical evaluation.

Game-theoretic white-hat hacking

If you can't prove your mechanism is secure, you can test it.

Arkhai's research approach to verification-via-replication is to train agents to maximize their utility, including cheating and/or colluding if necessary. Once these nodes are strong enough to find real weaknesses, mechanisms can be iterated and evaluated against the strategies the agents actually discover.

In practice, this starts in simulation. You build a digital twin of the protocol environment and let agents explore the strategy space safely before you ever trust the results in a live network.

Operationally, this becomes a loop:

  1. Train the best attackers you can.
  2. Observe the strategies they discover.
  3. Update the mechanism (incentives, penalties, verification rules).
  4. Retrain and repeat.

This inverts the typical workflow. Instead of designing a mechanism and hoping it's robust, you start by training the attackers. You let them find the weaknesses. You watch how they exploit the system. Then you design countermeasures and test whether the attackers can adapt.

The approach treats protocol design as adversarial competition. Red team versus blue team. The red team is trained to find exploits. The blue team is trained to close them. The protocol improves through iteration.

This requires the same agent-based primitives we described in our first post: environments (the protocol state), actions (cheating strategies, honest behavior, coordination with other nodes), transitions (how the protocol responds), and rewards (the utility function the attacker is trying to maximize). States, actions, transitions, and rewards are treated as first-class citizens in the architecture.

Multi-agent inverse reinforcement learning

The attacker/defender training loop draws heavily on multi-agent reinforcement learning, where each agent's learning changes the environment for the others.

Standard reinforcement learning asks: given a reward structure, what actions maximize reward? Inverse reinforcement learning asks the opposite: given observed behavior (typically demonstrations), what reward structure could have produced it?

Multi-agent inverse reinforcement learning extends this to groups of agents. It finds reward structures that elicit particular actions from collections of agents: for example, not cheating and not colluding in verification-via-replication-based distributed computing networks.

This is mechanism design through machine learning. Rather than deriving incentive structures analytically, you explore them empirically through adversarial testing. Trained attackers show you where the vulnerabilities are. The reward-design loop suggests what incentive changes might close those vulnerabilities.

These tools are not a silver bullet, but they are a promising way to explore incentive landscapes in settings where analytic modeling breaks down. And the space of incentive structures goes well beyond verifiable computing. Coalition formation, resource markets, negotiation protocols: anywhere agents have overlapping interests (sometimes cooperative, sometimes competitive, sometimes both), multi-agent RL and reward-learning can be useful.

Convergence

From a theoretical perspective, multi-agent learning dynamics in such environments are not guaranteed to converge to stable equilibria. There are convergence results in restricted settings, but there is no general guarantee that adversarial training will yield mechanisms robust to manipulation.

Multi-agent training can be unstable. Agents adapt to each other. The best strategy for attacker A depends on what defender B does, which depends on what A does, which depends on B. The feedback loops can cycle without settling. Convergence proofs that work for single-agent RL do not extend straightforwardly to multi-agent settings, in part because each agent faces a moving target.

This doesn't mean the approach is useless. We're not claiming to have solved adversarial mechanism design. We're claiming that this method provides empirical evidence about mechanism robustness, with clear acknowledgment that the evidence is not a guarantee.

Three outcomes are possible.

First, the adversarial training finds attacks, the mechanism updates successfully defend against them, and further adversarial training fails to find new attacks. This is the best case: you have empirical confidence that the mechanism is robust against the class of attacks your training can discover.

Second, the adversarial training and defensive updates enter an arms race. Each side keeps adapting. This is less satisfying, but still useful: you've learned that the mechanism requires ongoing maintenance, and you have a process for that maintenance.

Third, the adversarial training finds attacks that we can't defend against within the mechanism class we're exploring. That doesn't imply that no mechanism can exist; it suggests that our current assumptions, primitives, or threat model are insufficient. This outcome is still valuable: it tells you to change the design space (different verification, different trust assumptions, different market structure), rather than endlessly tuning parameters.

All three outcomes produce useful knowledge. That's the advantage of empirical methods: they don't just tell you whether you succeeded, they tell you what kind of problem you're dealing with.

What this means for you

If you're building protocols that need adversarial resistance, consider what guarantees you can actually provide. Formal proofs are reassuring, but they rest on assumptions that may not hold in production. Empirical testing doesn't give you proofs, but it gives you evidence from realistic conditions.

If you're integrating with Arkhai's infrastructure, expect adversarial evaluation to be part of how mechanisms are validated over time. Formal analysis matters, but the goal is to complement it with empirical evidence from simulated adversaries.

If you're researching multi-agent systems, multi-agent RL and reward-learning tools are underutilized in protocol design. Most mechanism design still happens analytically. There's room for empirical methods to become standard practice.

Next in the series

We've covered the full mechanism stack: Alkahest for exchange and commitment, verification for trust, collateral markets for economic guarantees, and adversarial design for robustness testing.

In our next post, we'll step back from mechanisms and look at economics. Where does idle compute fit into this picture? What happens when you tokenize latent capacity and create futures markets for work that has no buyer yet?


Arkhai is building machine-actionable marketplace infrastructure. If you're working on problems that intersect with compute markets, agent coordination, or decentralized infrastructure, we'd like to hear from you.