How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset

By ✦ min read

Introduction

When your LLM-powered multi-agent system fails on a task, you're not just left with a broken output — you're left with a headache. Which agent made the mistake? At what step did things go wrong? Manual log crawling feels like hunting for a single typo in a novel. Fortunately, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced a structured solution: automated failure attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides the first benchmark dataset (Who&When) and several evaluation methods to pinpoint the root cause of failures. This guide walks you through applying these tools to your own multi-agent systems, saving you hours of frustration.

How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset — Source: syncedreview.com

What You Need

Python 3.8+ environment
Git to clone the official code repository
Basic understanding of LLM multi-agent system architectures
Access to a Hugging Face account to download the Who&When dataset
Compute resources (a machine with at least 16GB RAM, GPU optional but recommended for large models)

Step-by-Step Guide

Step 1: Understand the Task of Failure Attribution

Before diving into code, grasp the core concept. In LLM multi-agent systems, multiple agents collaborate (e.g., via conversation or tool use) to solve a problem. A failure occurs when the final output is incorrect or incomplete. Failure attribution answers two questions: which agent caused the failure and at which point in the interaction (i.e., which timestamp or turn). The Who&When dataset simulates such failures with ground-truth labels, so you can evaluate the accuracy of your attribution method.

Step 2: Clone the Repository and Set Up the Environment

Open a terminal and run:
git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
Navigate to the directory:
cd Agents_Failure_Attribution
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
Install dependencies:
pip install -r requirements.txt

Step 3: Download the Who&When Dataset

The dataset is hosted on Hugging Face. Run the provided download script or use the Hugging Face datasets library:

from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")

Alternatively, visit the dataset page and download the files manually. Place them in a data/ folder within the repository.

Step 4: Understand the Dataset Structure

The dataset contains multi-agent interaction logs, each labeled with:

Failure type (e.g., reasoning error, miscommunication, missing information)
Responsible agent (by ID or role)
Failure timestamp (step index where the error first manifested)

Familiarize yourself with the format by examining a sample: dataset['train'][0] in Python.

Step 5: Choose an Attribution Method

The paper introduces several automated methods. Start with the trace-based method which uses a pre-trained LLM to analyze the entire interaction trace and predict the responsible agent and time. More advanced options include:

Contrastive attribution: compares failed traces with successful ones to isolate divergences.
Causal intervention: simulates “what if” scenarios by modifying agent outputs and checking if the failure is avoided.

The repository includes scripts for each. For your first run, use the default trace-based approach.

Step 6: Run Attribution on a Sample Failure

Execute the provided evaluation script:

python run_attribution.py --dataset_path ./data/Who_and_When --method trace_based --split test

This will analyze a batch of test cases and output predictions vs. ground truth. The script logs the results including accuracy for who and when separately.

Step 7: Interpret the Results

Check the output summary. A high who accuracy (e.g., >80%) indicates the method reliably identifies the failing agent. A low when accuracy suggests the method struggles with pinpointing the exact moment. Examine false positives — does the model blame an agent too early or too late? The paper reports baseline metrics (e.g., random guessing gives ~25% accuracy for who in a 4-agent system), so compare accordingly.

Step 8: Apply to Your Own Multi-Agent System

To use this on your custom system, you must log interactions in the same format as the dataset: a JSON or dict with keys for agent names, message content, timestamps, and final success/failure. Modify the attribution scripts to accept your data. The trace_based method can be adapted by feeding your logs to the LLM with a similar prompt template.

Tips for Success

Start simple: Begin with the provided dataset to validate that your environment runs correctly before applying to your own data.
Use a strong LLM: The attribution quality improves with more capable models (e.g., GPT-4, Claude, or open-source models like LLaMA-3). The default script supports OpenAI and Hugging Face models.
Log everything: For your own system, ensure you capture every input, output, and intermediate state of each agent. Missing logs lead to ambiguous attribution.
Combine methods: The paper shows that ensemble methods (e.g., averaging predictions from trace-based and contrastive) can boost accuracy by 5–10%.
Beware of cascading failures: Sometimes an earlier harmless error triggers a later failure. The “when” label might be earlier than the obvious first mistake — the dataset accounts for this.
Iterate: Use failure attribution as a feedback loop. Once you find a common failure pattern (e.g., Agent 3 consistently misreads numeric data), modify that agent’s prompt or tool use.

Tags: