How to Automate Agent Trajectory Analysis with GitHub Copilot

By ✦ min read

Introduction

When your job involves evaluating coding agents, you quickly realize that sifting through hundreds of thousands of lines of trajectory data is unsustainable. Each run produces JSON files packed with agent thoughts and actions, and analyzing these manually is a recipe for burnout. But what if you could automate the repetitive part of that analysis—letting you focus on the insights that matter? That's exactly what I did using GitHub Copilot, and in this guide, I'll show you how to build your own automated agent-driven development loop. You'll learn to set up, script, and deploy tools that turn tedious data review into a streamlined, collaborative process.

How to Automate Agent Trajectory Analysis with GitHub Copilot — Source: github.blog

What You Need

GitHub Copilot (either Individual, Business, or Enterprise license) with access to agent functionality
A set of evaluation benchmarks, such as TerminalBench2 or SWE-bench-Pro, with trajectory data in JSON format
Basic knowledge of Python and JSON
Familiarity with the command line and version control (Git)
A code editor (e.g., VS Code) with Copilot extension installed
(Optional) Access to a shared repository for team collaboration

Step-by-Step Guide

Step 1: Understand Your Data

Before automating, you must know what you're working with. Trajectory files typically contain a list of agent steps—each step includes the agent's thought process, the action taken, and the result. Open one such JSON file and familiarize yourself with its structure using Copilot:

In VS Code, ask Copilot Chat: “Summarize the structure of this JSON trajectory file.”
Use inline suggestions to write a quick Python script that extracts key fields (e.g., action types, token counts, success status).

This exploration phase helps you identify the patterns you'll later automate.

Step 2: Define Your Analysis Questions

What are you looking for? Common questions include:

How often does the agent fail on specific task types?
What is the average length of a successful trajectory vs. a failed one?
Which actions appear most frequently when the agent gets stuck?

Write these down. They become the core of your automated analysis scripts. Use Copilot to generate a list of potential metrics based on your data fields.

Step 3: Create a Modular Script for Analysis

Instead of writing one monolithic script, break your analysis into small, reusable functions. For example:

load_trajectories(folder_path) – Loads all JSON files from a given directory.
extract_actions(trajectory) – Returns a list of action types from a single trajectory.
compute_success_rate(trajectories) – Calculates the percentage of successful tasks.

Copilot excels at generating these functions. Start with a blank Python file, type a descriptive comment (e.g., # function to count the number of times the agent retries), and let Copilot suggest the implementation. Review and adjust as needed.

Step 4: Build an Automated Analysis Loop

Now it's time to automate the repetitive cycle you used to do manually. Create a main script that:

Reads the latest batch of trajectory files from a benchmark run.
Runs your analysis functions.
Outputs a summary report (e.g., as a markdown file or console printout).

Use Copilot to help you write file-watching logic or cron job setup. For instance, ask: “Write a Python script that monitors a folder for new JSON files and runs analysis whenever a file is added.” Copilot will generate a skeleton you can customize.

Step 5: Turn the Script into a Configurable Agent

To make this tool shareable and extensible, package it as a command-line agent. Use the argparse library to accept parameters like:

--benchmark-name – Which benchmark dataset to analyze
--output-format – e.g., JSON, CSV, or HTML
--filters – Conditions to focus on specific trajectory types

Copilot can scaffold this structure for you. Type a comment like # CLI entry point for eval-agents tool and start accepting suggestions. Also consider adding a README.md—Copilot can generate an initial draft based on your code comments.

Step 6: Enable Collaboration via GitHub

Your agent is most powerful when your team can use and improve it. Push your repository to GitHub and set up a simple contribution workflow:

Write clear issue templates for new analysis features.
Use GitHub Actions to run tests on pull requests.
Add inline documentation generated by Copilot for each function.

Encourage team members to open PRs with new analysis modules. Because you designed the agent with modularity in mind, adding a new metric takes only a few lines of code—and Copilot can help them write it too.

Tips for Success

Start small: Automate one repetitive question first, then expand. This reduces initial friction and builds momentum.
Leverage Copilot Chat: When stuck on a complex function, describe your goal in natural language and let Copilot propose solutions. You'll often get code you can adapt.
Design for sharing: Use standard data formats and include a setup.py or requirements.txt so others can install your agent with a single command.
Keep humans in the loop: Your agent should produce summaries, not final decisions. Always review the automated output for edge cases.
Iterate with feedback: Ask teammates what analysis they find most valuable. Their needs will guide your next development sprint.
Document as you code: Use Copilot to generate docstrings and README sections while the logic is fresh in your mind.

By following these steps, you'll transform tedious manual data sifting into an automated, collaborative process. You'll not only save hours each week but also enable your entire team to contribute to the analysis—turning one person's intellectual toil into a shared productivity boost.

Tags: