name: analyze-eval description: Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.

Analyze Eval

User shares a URL like https://convex-evals.netlify.app/experiment/.../run/$runId/$category/$evalId
User asks "why did this eval fail?" or "what went wrong with this eval?"
User references a specific eval ID

The visualizer URL pattern is:

/experiment/$experimentId/run/$runId/$category/$evalId?tab=steps

$runId — the Convex document ID for the run (e.g. jn7922j1w29pdxm76bj9ps0enx80mg9e)
$evalId — the Convex document ID for the specific eval (e.g. jh73jvjz2n00gfeve1dt5h963s80mbc6)

You need the evalId to query.

Run the internal action from the evalScores/ directory. Always use --prod to query the production database (where CI writes results):

npx convex run --prod debug:getEvalDebugInfo '{"evalId": "<evalId>"}'

This returns a JSON object with:

Field	Contents
`eval`	Name, category, evalPath, status (pass/fail + failure reason), task text
`run`	Model name, provider, experiment name, run status
`steps`	Array of step results: filesystem, install, deploy, tsc, eslint, tests — each with pass/fail/skipped and failure reason
`outputFiles`	Map of file path -> file content from the model's generated output (unzipped)
`evalSourceFiles`	Map of file path -> file content from the eval source (answer dir, grader, TASK.txt, etc.)

With the data returned, compare:

Which step failed? — Check steps for the first entry with status.kind === "failed". The failureReason field has the error message.
What did the model generate? — Look at outputFiles for the model's code.
What was expected? — Look at evalSourceFiles for the answer directory and grader test files.
What was the task? — Check eval.task for the TASK.txt content.

Common failure patterns:

eslint fail — Check the failure reason for the specific lint rule violated. Compare the model output against the answer to spot the lint issue.
tsc fail — TypeScript compilation error. Check the failure reason for the specific type error.
convex dev fail — Schema or function definition issues that prevent Convex from deploying.
tests fail — The grader tests didn't pass. Compare outputFiles against evalSourceFiles (look for files like grader.test.ts or answer/) to understand what the tests expected.

Classify the failure as one of:

MODEL_FAULT: The model genuinely got it wrong
OVERLY_STRICT: The eval/lint/test requirements are unreasonable for what was asked
AMBIGUOUS_TASK: The task description is unclear and the model's interpretation was reasonable
KNOWN_GAP: A known limitation of this eval that affects all models (e.g. the Convex API returns fields the model can't predict without being told)

Summarize:

Name	analyze-eval
Description	Investigate a single failing eval from the convex-evals system. Use when the user shares a visualizer URL pointing to a specific eval, asks about a specific failing eval, or references a specific eval ID.