Agent Skill
2/7/2026

testing-ai-agents

Use when testing AI agent code with pytest. Covers TDD for agent APIs, mocking LLM calls (NOT evaluating LLM outputs), pytest-asyncio patterns, FastAPI testing with httpx, SQLModel testing, and agent tool testing. NOT for evaluating LLM reasoning quality (use evals skill).

P
panaversity
128GitHub Stars
2Views
npx skills add panaversity/agentfactory

SKILL.md

Nametesting-ai-agents
DescriptionUse when testing AI agent code with pytest. Covers TDD for agent APIs, mocking LLM calls (NOT evaluating LLM outputs), pytest-asyncio patterns, FastAPI testing with httpx, SQLModel testing, and agent tool testing. NOT for evaluating LLM reasoning quality (use evals skill).

name: testing-ai-agents description: Use when testing AI agent code with pytest. Covers TDD for agent APIs, mocking LLM calls (NOT evaluating LLM outputs), pytest-asyncio patterns, FastAPI testing with httpx, SQLModel testing, and agent tool testing. NOT for evaluating LLM reasoning quality (use evals skill).

Testing AI Agents: TDD for Agent Code

Test-Driven Development for agent applications. This skill covers testing code correctness (deterministic, passes/fails), NOT measuring LLM reasoning quality (probabilistic, scores - use evals for that).

Critical Distinction: TDD vs Evals

AspectTDD (This Skill)Evals (Chapter 47)
QuestionDoes the code work correctly?Does the LLM reason well?
NatureDeterministicProbabilistic
OutputPass/FailScores (0-1)
TestsFunctions, APIs, DB operationsResponse quality, faithfulness
SpeedFast (mocked LLM)Slow (real LLM calls)
CostZero (no API calls)High (API calls required)

Quick Start: Project Setup

# Install testing dependencies
uv add --dev pytest pytest-asyncio httpx respx pytest-cov

# Configure pytest
cat > pyproject.toml << 'EOF'
[tool.pytest.ini_options]
asyncio_mode = "auto"
asyncio_default_fixture_loop_scope = "function"
testpaths = ["tests"]
EOF

Core Testing Patterns

Pattern 1: Async Test Setup

# tests/conftest.py
import os
import pytest
from httpx import ASGITransport, AsyncClient
from sqlalchemy.ext.asyncio import create_async_engine, async_sessionmaker
from sqlalchemy.pool import StaticPool
from sqlmodel import SQLModel
from sqlmodel.ext.asyncio.session import AsyncSession

# Set environment FIRST
os.environ.setdefault("DATABASE_URL", "sqlite+aiosqlite:///:memory:")
os.environ.setdefault("OPENAI_API_KEY", "test-key-not-used")

from app.main import app
from app.database import get_session
from app.auth import get_current_user

# Test database
TEST_DATABASE_URL = "sqlite+aiosqlite:///:memory:"

test_engine = create_async_engine(
    TEST_DATABASE_URL,
    echo=False,
    poolclass=StaticPool,
    connect_args={"check_same_thread": False},
)

TestAsyncSession = async_sessionmaker(
    test_engine,
    class_=AsyncSession,
    expire_on_commit=False,
)

# Mock user
TEST_USER = {"sub": "test-user-123", "email": "test@example.com"}

@pytest.fixture(scope="session")
def event_loop():
    """Create event loop for session-scoped fixtures."""
    import asyncio
    loop = asyncio.get_event_loop_policy().new_event_loop()
    yield loop
    loop.close()

@pytest.fixture(autouse=True)
async def setup_database():
    """Create tables before each test, drop after."""
    async with test_engine.begin() as conn:
        await conn.run_sync(SQLModel.metadata.create_all)
    yield
    async with test_engine.begin() as conn:
        await conn.run_sync(SQLModel.metadata.drop_all)

async def get_test_session():
    async with TestAsyncSession() as session:
        yield session

def get_test_user():
    return TEST_USER

@pytest.fixture
async def client():
    """Async test client with mocked dependencies."""
    app.dependency_overrides[get_session] = get_test_session
    app.dependency_overrides[get_current_user] = get_test_user

    async with AsyncClient(
        transport=ASGITransport(app=app),
        base_url="http://test",
    ) as ac:
        yield ac

    app.dependency_overrides.clear()

Pattern 2: Testing FastAPI Endpoints

# tests/test_tasks.py
import pytest
from httpx import AsyncClient

@pytest.mark.asyncio
async def test_create_task(client: AsyncClient):
    """Test creating a task via API."""
    response = await client.post(
        "/api/tasks",
        json={"title": "Test Task", "priority": "high"},
    )
    assert response.status_code == 201
    data = response.json()
    assert data["title"] == "Test Task"
    assert data["priority"] == "high"
    assert data["status"] == "pending"

@pytest.mark.asyncio
async def test_get_task_not_found(client: AsyncClient):
    """Test 404 for non-existent task."""
    response = await client.get("/api/tasks/99999")
    assert response.status_code == 404

@pytest.mark.asyncio
async def test_list_tasks_with_filter(client: AsyncClient):
    """Test filtering tasks by status."""
    # Create test data
    await client.post("/api/tasks", json={"title": "Task 1"})
    await client.post("/api/tasks", json={"title": "Task 2"})

    # Filter by status
    response = await client.get("/api/tasks", params={"status": "pending"})
    assert response.status_code == 200
    data = response.json()
    assert len(data) == 2

Pattern 3: Testing SQLModel Operations

# tests/test_models.py
import pytest
from sqlmodel.ext.asyncio.session import AsyncSession
from app.models import Task, Project

@pytest.fixture
async def session():
    """Direct database session for model testing."""
    async with TestAsyncSession() as session:
        yield session

@pytest.mark.asyncio
async def test_create_task(session: AsyncSession):
    """Test Task model creation."""
    task = Task(title="Test", priority="high")
    session.add(task)
    await session.commit()
    await session.refresh(task)

    assert task.id is not None
    assert task.created_at is not None

@pytest.mark.asyncio
async def test_cascade_delete(session: AsyncSession):
    """Test parent-child cascade deletion."""
    project = Project(name="Test Project")
    session.add(project)
    await session.commit()

    task = Task(title="Test", project_id=project.id)
    session.add(task)
    await session.commit()

    # Delete parent
    await session.delete(project)
    await session.commit()

    # Verify child deleted
    result = await session.get(Task, task.id)
    assert result is None

Pattern 4: Mocking LLM Calls with respx

# tests/test_agent_tools.py
import pytest
import respx
import httpx
from app.agent import call_openai

@pytest.mark.asyncio
@respx.mock
async def test_openai_completion():
    """Mock OpenAI API response."""
    # Mock the API endpoint
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(
            200,
            json={
                "choices": [{
                    "message": {
                        "role": "assistant",
                        "content": "Hello, I can help with that!"
                    }
                }],
                "usage": {"total_tokens": 50}
            }
        )
    )

    # Call your function
    result = await call_openai("Say hello")

    assert "Hello" in result
    assert respx.calls.call_count == 1

@pytest.mark.asyncio
@respx.mock
async def test_openai_rate_limit():
    """Test rate limit handling."""
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(429, json={"error": "Rate limited"})
    )

    with pytest.raises(RateLimitError):
        await call_openai("Test")

@pytest.mark.asyncio
@respx.mock
async def test_tool_call_parsing():
    """Test agent parses tool calls correctly."""
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(
            200,
            json={
                "choices": [{
                    "message": {
                        "role": "assistant",
                        "tool_calls": [{
                            "id": "call_123",
                            "function": {
                                "name": "get_weather",
                                "arguments": '{"city": "London"}'
                            }
                        }]
                    }
                }]
            }
        )
    )

    result = await agent.process("What's the weather in London?")

    assert result.tool_calls[0].function.name == "get_weather"
    assert result.tool_calls[0].function.arguments["city"] == "London"

Pattern 5: Using pytest-mockllm

# tests/test_with_mockllm.py
import pytest

def test_anthropic_mock(mock_anthropic):
    """Test with pytest-mockllm for Anthropic."""
    mock_anthropic.add_response("I can help with that task!")

    from anthropic import Anthropic
    client = Anthropic(api_key="fake")

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Hello!"}]
    )

    assert "help" in response.content[0].text

def test_openai_mock(mock_openai):
    """Test with pytest-mockllm for OpenAI."""
    mock_openai.add_response("Task completed successfully.")

    from openai import OpenAI
    client = OpenAI(api_key="fake")

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": "Complete this task"}]
    )

    assert "completed" in response.choices[0].message.content

Pattern 6: Testing Agent Tools in Isolation

# tests/test_tools.py
import pytest
from app.tools import search_database, format_response, validate_input

@pytest.mark.asyncio
async def test_search_tool():
    """Test database search tool function."""
    # This tests the tool logic, NOT the LLM
    results = await search_database(query="python")

    assert isinstance(results, list)
    assert all("python" in r["title"].lower() for r in results)

def test_format_response():
    """Test response formatting utility."""
    raw = {"items": [1, 2, 3], "count": 3}
    formatted = format_response(raw)

    assert "3 items found" in formatted

def test_validate_input_rejects_injection():
    """Test input validation blocks SQL injection."""
    malicious = "'; DROP TABLE users; --"

    with pytest.raises(ValidationError):
        validate_input(malicious)

Pattern 7: Integration Tests with Mocked LLM

# tests/integration/test_agent_pipeline.py
import pytest
import respx
import httpx

@pytest.mark.asyncio
@respx.mock
async def test_complete_agent_flow(client: AsyncClient):
    """Test full agent pipeline with mocked LLM."""
    # Mock LLM to return a tool call
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        side_effect=[
            # First call: LLM decides to use tool
            httpx.Response(200, json={
                "choices": [{
                    "message": {
                        "role": "assistant",
                        "tool_calls": [{
                            "id": "call_1",
                            "function": {
                                "name": "create_task",
                                "arguments": '{"title": "New Task"}'
                            }
                        }]
                    }
                }]
            }),
            # Second call: LLM responds with result
            httpx.Response(200, json={
                "choices": [{
                    "message": {
                        "role": "assistant",
                        "content": "I created the task 'New Task' for you."
                    }
                }]
            })
        ]
    )

    # Call agent endpoint
    response = await client.post(
        "/api/agent/chat",
        json={"message": "Create a task called 'New Task'"}
    )

    assert response.status_code == 200
    data = response.json()
    assert "created" in data["response"].lower()

    # Verify task was actually created in DB
    tasks = await client.get("/api/tasks")
    assert any(t["title"] == "New Task" for t in tasks.json())

Pattern 8: Testing Error Handling

# tests/test_error_handling.py
import pytest
import respx
import httpx

@pytest.mark.asyncio
@respx.mock
async def test_llm_timeout_handling():
    """Test graceful handling of LLM timeout."""
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        side_effect=httpx.TimeoutException("Connection timed out")
    )

    with pytest.raises(AgentTimeoutError) as exc_info:
        await agent.process("Test query")

    assert "LLM request timed out" in str(exc_info.value)

@pytest.mark.asyncio
@respx.mock
async def test_malformed_response_handling():
    """Test handling of malformed LLM response."""
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(200, json={"invalid": "response"})
    )

    with pytest.raises(AgentResponseError):
        await agent.process("Test query")

@pytest.mark.asyncio
async def test_database_error_handling(client: AsyncClient):
    """Test API handles database errors gracefully."""
    # Force a constraint violation
    await client.post("/api/tasks", json={"title": "Task 1"})
    response = await client.post("/api/tasks", json={"title": "Task 1"})  # Duplicate

    assert response.status_code == 400
    assert "already exists" in response.json()["error"]

Test Organization

tests/
├── conftest.py              # Shared fixtures
├── unit/
│   ├── test_models.py       # SQLModel tests
│   ├── test_tools.py        # Agent tool tests
│   └── test_utils.py        # Utility function tests
├── integration/
│   ├── test_api.py          # FastAPI endpoint tests
│   └── test_agent.py        # Agent pipeline tests (mocked LLM)
└── e2e/
    └── test_flows.py        # End-to-end flows (still mocked LLM)

Fixtures Reference

FixtureScopePurpose
event_loopsessionShared async event loop
setup_databasefunctionFresh DB per test
sessionfunctionDirect DB access
clientfunctionAsync HTTP client
mock_userfunctionTest authentication

Best Practices

DO

  • Mock LLM calls at HTTP level (respx, httpx.MockTransport)
  • Use in-memory SQLite for fast DB tests
  • Test tool logic separately from LLM orchestration
  • Override FastAPI dependencies for auth/DB
  • Use factories for test data creation

DON'T

  • Make real LLM API calls in unit tests
  • Share state between tests
  • Test LLM reasoning quality (that's evals)
  • Skip error path testing
  • Use production databases

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=html

# Run specific test file
pytest tests/test_tasks.py

# Run tests matching pattern
pytest -k "test_create"

# Run async tests only
pytest -m asyncio

# Verbose output
pytest -v

CI/CD Integration

# .github/workflows/test.yml
name: Tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - run: uv sync --all-extras
      - run: uv run pytest --cov --cov-report=xml
      - uses: codecov/codecov-action@v4

References

For detailed patterns, see:

Skills Info
Original Name:testing-ai-agentsAuthor:panaversity