Agents

Graham Neubig

Carnegie Mellon University Language Technologies Institute

Agent Definition

  • A system that iteratively uses tools to achieve a task
  • Almost all agents are powered by LLMs
  • They pose unique inference challenges in
    • Reasoning and planning
    • Environment representation
    • Long-context modeling
    • Evaluation
    • Critic models
    • Multi-agent delegation

Agent Use Cases

  • Software Development: Code generation, debugging, testing
    • Example: Claude Code, Codex, OpenHands
  • Web Automation: Data extraction, form filling, testing
    • Example: OpenAI agent mode, Browser-use
  • Research & Analysis: Information gathering, report generation
    • Example: Perplexity Pro, OpenDeepResearch
  • Interactive Environments: Games, simulations, robotics
    • Example: Minecraft agents, robotic control

My Personal Experience: OpenHands

A software development agent w/ web browsing

OpenHands screenshot

Reasoning and Planning

A Basic Agentic Loop: React

ReAct framework
  • Key Idea: Interleave reasoning (thought) and acting (action) in language models
  • Method: Generate thoughts to track progress, plan actions, and handle exceptions
  • Benefits: Better interpretability, human-like problem decomposition, reduced hallucination

Agents with Coding: CodeAct

CodeAct framework
  • Key Idea: Use executable code as the unified interface for LLM agents
  • Method: Replace complex action spaces with Python code execution
  • Benefits: Simplifies action space, improves tool use

How to Plan with Agents: OpenHands Planning Tool

Two-Tool Planning System:

1. Task Tracker Tool - Structured task management

{
  "command": "plan",
  "task_list": [
    {"id": "1", "title": "Analyze requirements", "status": "done"},
    {"id": "2", "title": "Implement feature", "status": "in_progress"},
    {"id": "3", "title": "Write tests", "status": "todo"}
  ]
}

2. Think Tool - Complex reasoning and brainstorming

{
  "thought": "I need to consider three approaches: direct implementation, 
   refactoring existing code, or creating a new module. The refactoring 
   approach seems most maintainable..."
}

Environment Representation

Environment Representation - Text Based

e.g. from Tavily MCP extract function

# Language Technologies Institute
## School of Computer Science - Carnegie Mellon University

### Mission Statement
**Empowering human communication through trustworthy language technologies**

The Language Technologies Institute at Carnegie Mellon educates the leaders of tomorrow and performs groundbreaking research in the areas of:

### Research Areas

- **Natural Language Processing**
- **Computational Linguistics** 
- **Information Extraction**

Environment Representation - Accessibility Tree

e.g. BrowserGym axtree

RootWebArea 'Language Technologies Institute - Language Technologies Institute - School of Computer Science - Carnegie Mellon University', focused
	[40] banner '', center="(632,127)", visible
		[43] link 'Carnegie Mellon University', center="(177,21)", clickable, visible
		[44] button 'toggle menu', center="(1221,18)", clickable, visible, pressed='false'
			StaticText '—'
			StaticText '—'
			StaticText '—'
		[48] Section '', center="(1047,19)", visible
			[52] LabelText '', center="(1047,19)", clickable, visible
				[54] textbox 'Search', center="(1047,19)", clickable, visible
			[55] button 'Search', center="(1183,19)", clickable, visible
		[62] link 'Language Technologies Institute', center="(369,107)", clickable, visible
		[63] heading 'School of Computer Science', center="(430,162)", visible
			[64] link 'School of Computer Science', center="(174,162)", clickable, visible
		[65] link 'LTI Logo', center="(1049,126)", clickable, visible
			[66] image 'LTI Logo', center="(1161,130)", visible

Environment Representation - Visual

e.g. VisualWebArena

Agent reasoning trajectory

Efficiency Optimizations

Context Length Challenges

  • Long conversation histories
  • Large environment descriptions
  • Multiple tool call results
  • Solution: Need efficient context management strategies

Prompt Caching

  • Problem: Agent workflows repeat system prompts and context across turns
  • Solution: Cache and reuse prefix

Context Condensation

OpenHands approach:

Context condensation overview
  • Challenge: Context grows quadratically with conversation length
  • Solution: Intelligent summarization while preserving key information
  • Results: 2x cost reduction, equivalent performance on SWE-bench

Web Context Condensation

  • Challenge: Web pages are large and noisy
  • Solution: Keep only the most recent axtree or image
  • Challenge: Prompt caching is less effective

Evaluation

SWE-Bench

SWE-Bench overview
  • Purpose: Evaluate language models on real-world software engineering tasks
  • Dataset: 2,294 GitHub issues from 12 popular Python repositories
  • Task: Generate patches to resolve actual software bugs and feature requests

WebArena

WebArena overview
  • Purpose: Realistic web environment for building autonomous agents
  • Environment: Fully functional websites from 4 domains (e-commerce, forums, dev, CMS)
  • Tasks: Complex, multi-step web interactions with functional correctness evaluation

GAIA

GAIA overview
  • Purpose: Benchmark for General AI Assistants on real-world questions
  • Philosophy: Tasks simple for humans but challenging for AI systems
  • Requirements: Reasoning, multi-modality, web browsing, tool use proficiency

Critic Models

Critic Models for Agents

  • Motivation: It can help to have a “second opinion”
  • Use Case 1: Rerank multiple candidates to increase accuracy
  • Use Case 2: Audit to ensure safety

Reranking: SWE-Gym

  • Train a critic model to rerank multiple candidate patches
  • Roll out N times and pick the best one
SWE-Gym reranking overview

Multi-agent Systems

Why use multiple agents?

  • Specialization: Different agents for different tasks
  • Scalability: Parallel processing of subtasks
  • Robustness: Redundancy and error checking

Why not use multiple agents?

  • Complexity: More components to manage and integrate
  • Task Decomposition: Hard to break down tasks effectively
  • Communication Difficulties: Ensuring exactly appropriate context is passed

Example: AutoGen

AutoGen framework
  • Framework: Multi-agent conversation system for LLM applications
  • Key Features: Customizable agents, flexible conversation patterns, human-in-the-loop
  • Applications: Mathematics, coding, question answering, decision-making
  • Benefits: Enables complex workflows through agent collaboration

Summary

  • Agents are powerful tools for complex tasks
  • Key challenges include reasoning, environment representation, context management, and evaluation
  • Multi-agent systems and critic models enhance capabilities and reliability