How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset

By ✦ min read

Introduction

When your LLM-powered multi-agent system fails on a task, you're not just left with a broken output — you're left with a headache. Which agent made the mistake? At what step did things go wrong? Manual log crawling feels like hunting for a single typo in a novel. Fortunately, researchers from Penn State University and Duke University, in collaboration with Google DeepMind, the University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced a structured solution: automated failure attribution. Their work, accepted as a Spotlight presentation at ICML 2025, provides the first benchmark dataset (Who&When) and several evaluation methods to pinpoint the root cause of failures. This guide walks you through applying these tools to your own multi-agent systems, saving you hours of frustration.

How to Automatically Attribute Failures in LLM Multi-Agent Systems Using the Who&When Dataset
Source: syncedreview.com

What You Need

Step-by-Step Guide

Step 1: Understand the Task of Failure Attribution

Before diving into code, grasp the core concept. In LLM multi-agent systems, multiple agents collaborate (e.g., via conversation or tool use) to solve a problem. A failure occurs when the final output is incorrect or incomplete. Failure attribution answers two questions: which agent caused the failure and at which point in the interaction (i.e., which timestamp or turn). The Who&When dataset simulates such failures with ground-truth labels, so you can evaluate the accuracy of your attribution method.

Step 2: Clone the Repository and Set Up the Environment

  1. Open a terminal and run:
    git clone https://github.com/mingyin1/Agents_Failure_Attribution.git
  2. Navigate to the directory:
    cd Agents_Failure_Attribution
  3. Create a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate # On Windows: venv\Scripts\activate
  4. Install dependencies:
    pip install -r requirements.txt

Step 3: Download the Who&When Dataset

The dataset is hosted on Hugging Face. Run the provided download script or use the Hugging Face datasets library:

from datasets import load_dataset
dataset = load_dataset("Kevin355/Who_and_When")

Alternatively, visit the dataset page and download the files manually. Place them in a data/ folder within the repository.

Step 4: Understand the Dataset Structure

The dataset contains multi-agent interaction logs, each labeled with:

Familiarize yourself with the format by examining a sample: dataset['train'][0] in Python.

Step 5: Choose an Attribution Method

The paper introduces several automated methods. Start with the trace-based method which uses a pre-trained LLM to analyze the entire interaction trace and predict the responsible agent and time. More advanced options include:

The repository includes scripts for each. For your first run, use the default trace-based approach.

Step 6: Run Attribution on a Sample Failure

Execute the provided evaluation script:

python run_attribution.py --dataset_path ./data/Who_and_When --method trace_based --split test

This will analyze a batch of test cases and output predictions vs. ground truth. The script logs the results including accuracy for who and when separately.

Step 7: Interpret the Results

Check the output summary. A high who accuracy (e.g., >80%) indicates the method reliably identifies the failing agent. A low when accuracy suggests the method struggles with pinpointing the exact moment. Examine false positives — does the model blame an agent too early or too late? The paper reports baseline metrics (e.g., random guessing gives ~25% accuracy for who in a 4-agent system), so compare accordingly.

Step 8: Apply to Your Own Multi-Agent System

To use this on your custom system, you must log interactions in the same format as the dataset: a JSON or dict with keys for agent names, message content, timestamps, and final success/failure. Modify the attribution scripts to accept your data. The trace_based method can be adapted by feeding your logs to the LLM with a similar prompt template.

Tips for Success

Tags:

Recommended

Discover More

Fructose's Hidden Impact: How This Common Sweetener May Be Disrupting Your MetabolismCritical Vulnerability in Third-Party Tar Crate Affects Rust's Cargo Package ManagerHow to Build Your First AI Agent with Microsoft Agent FrameworkFord Reports Strong Q1 Performance Driven by Tariff Refund and Plant RecoveryNYT Strands Game #791 Solved: Sunday, May 3 Spangram Revealed – Hints and Answers Now Live