Agents

Graham Neubig

Agent Definition

A system that iteratively uses tools to achieve a task
Almost all agents are powered by LLMs
They pose unique inference challenges in
- Reasoning and planning
- Environment representation
- Long-context modeling
- Evaluation
- Critic models
- Multi-agent delegation

Agent Use Cases

Software Development: Code generation, debugging, testing
- Example: Claude Code, Codex, OpenHands
Web Automation: Data extraction, form filling, testing
- Example: OpenAI agent mode, Browser-use
Research & Analysis: Information gathering, report generation
- Example: Perplexity Pro, OpenDeepResearch
Interactive Environments: Games, simulations, robotics
- Example: Minecraft agents, robotic control

My Personal Experience: OpenHands

A software development agent w/ web browsing

Software: https://github.com/All-Hands-AI/OpenHands
Demo: https://app.all-hands.dev/

Reasoning and Planning

A Basic Agentic Loop: React

Key Idea: Interleave reasoning (thought) and acting (action) in language models
Method: Generate thoughts to track progress, plan actions, and handle exceptions
Benefits: Better interpretability, human-like problem decomposition, reduced hallucination

Agents with Coding: CodeAct

Key Idea: Use executable code as the unified interface for LLM agents
Method: Replace complex action spaces with Python code execution
Benefits: Simplifies action space, improves tool use

How to Plan with Agents: OpenHands Planning Tool

Two-Tool Planning System:

1. Task Tracker Tool - Structured task management

{
  "command": "plan",
  "task_list": [
    {"id": "1", "title": "Analyze requirements", "status": "done"},
    {"id": "2", "title": "Implement feature", "status": "in_progress"},
    {"id": "3", "title": "Write tests", "status": "todo"}
  ]
}

2. Think Tool - Complex reasoning and brainstorming

{
  "thought": "I need to consider three approaches: direct implementation, 
   refactoring existing code, or creating a new module. The refactoring 
   approach seems most maintainable..."
}

Environment Representation

Environment Representation - Text Based

e.g. from Tavily MCP extract function

# Language Technologies Institute
## School of Computer Science - Carnegie Mellon University

### Mission Statement
**Empowering human communication through trustworthy language technologies**

The Language Technologies Institute at Carnegie Mellon educates the leaders of tomorrow and performs groundbreaking research in the areas of:

### Research Areas

- **Natural Language Processing**
- **Computational Linguistics** 
- **Information Extraction**

Environment Representation - Accessibility Tree

e.g. BrowserGym axtree

RootWebArea 'Language Technologies Institute - Language Technologies Institute - School of Computer Science - Carnegie Mellon University', focused
	[40] banner '', center="(632,127)", visible
		[43] link 'Carnegie Mellon University', center="(177,21)", clickable, visible
		[44] button 'toggle menu', center="(1221,18)", clickable, visible, pressed='false'
			StaticText '—'
			StaticText '—'
			StaticText '—'
		[48] Section '', center="(1047,19)", visible
			[52] LabelText '', center="(1047,19)", clickable, visible
				[54] textbox 'Search', center="(1047,19)", clickable, visible
			[55] button 'Search', center="(1183,19)", clickable, visible
		[62] link 'Language Technologies Institute', center="(369,107)", clickable, visible
		[63] heading 'School of Computer Science', center="(430,162)", visible
			[64] link 'School of Computer Science', center="(174,162)", clickable, visible
		[65] link 'LTI Logo', center="(1049,126)", clickable, visible
			[66] image 'LTI Logo', center="(1161,130)", visible

Environment Representation - Visual

e.g. VisualWebArena

Efficiency Optimizations

Context Length Challenges

Long conversation histories
Large environment descriptions
Multiple tool call results
Solution: Need efficient context management strategies

Prompt Caching

Problem: Agent workflows repeat system prompts and context across turns
Solution: Cache and reuse prefix

Context Condensation

OpenHands approach:

Challenge: Context grows quadratically with conversation length
Solution: Intelligent summarization while preserving key information
Results: 2x cost reduction, equivalent performance on SWE-bench

Web Context Condensation

Challenge: Web pages are large and noisy
Solution: Keep only the most recent axtree or image
Challenge: Prompt caching is less effective

Evaluation

SWE-Bench

Purpose: Evaluate language models on real-world software engineering tasks
Dataset: 2,294 GitHub issues from 12 popular Python repositories
Task: Generate patches to resolve actual software bugs and feature requests

WebArena

Purpose: Realistic web environment for building autonomous agents
Environment: Fully functional websites from 4 domains (e-commerce, forums, dev, CMS)
Tasks: Complex, multi-step web interactions with functional correctness evaluation

GAIA

Purpose: Benchmark for General AI Assistants on real-world questions
Philosophy: Tasks simple for humans but challenging for AI systems
Requirements: Reasoning, multi-modality, web browsing, tool use proficiency

Critic Models

Critic Models for Agents

Motivation: It can help to have a “second opinion”
Use Case 1: Rerank multiple candidates to increase accuracy
Use Case 2: Audit to ensure safety

Reranking: SWE-Gym

Train a critic model to rerank multiple candidate patches
Roll out N times and pick the best one

Multi-agent Systems

Why use multiple agents?

Specialization: Different agents for different tasks
Scalability: Parallel processing of subtasks
Robustness: Redundancy and error checking

Why not use multiple agents?

Complexity: More components to manage and integrate
Task Decomposition: Hard to break down tasks effectively
Communication Difficulties: Ensuring exactly appropriate context is passed

Example: AutoGen

Framework: Multi-agent conversation system for LLM applications
Key Features: Customizable agents, flexible conversation patterns, human-in-the-loop
Applications: Mathematics, coding, question answering, decision-making
Benefits: Enables complex workflows through agent collaboration

Summary

Agents are powerful tools for complex tasks
Key challenges include reasoning, environment representation, context management, and evaluation
Multi-agent systems and critic models enhance capabilities and reliability