vibecheck — a YAML-first eval framework for any LLM
An agent-evaluation framework built around a simple YAML DSL. Compare models, save suites, mix string matching with semantic and LLM-judge checks, and run multi-model evals from the command line. Open source CLI, hosted service in invite-only preview at vibescheck.io.
the problem
Evals are how you stop debugging-by-vibe and start shipping with confidence — but the tooling forces a choice between heavyweight platforms (good, but a commitment) and ad-hoc scripts (fast, but disposable). For a solo builder or a small team that just needs to ask “did this prompt change make things better or worse, on which models?”, the gap is real.
what shipped
vibecheck — a CLI with a YAML DSL designed for the tightest possible iteration loop.
metadata:
name: hello-world
model: anthropic/claude-3.5-sonnet
evals:
- prompt: Say hello
checks:
- match: "*hello*"
- min_tokens: 1
- max_tokens: 50
vibe check -f hello-world.yaml
Things that fell out of that DSL:
- Multiple check kinds in the same suite. Glob
match,semanticsimilarity,llm_judgequality assessments, token bounds. Cheap checks short-circuit before expensive ones. - Multi-model comparison.
-m "openai*,anthropic*"runs the same suite across providers. Results sort by price-performance. - Suites, variables, and secrets as first-class CLI objects.
vibe set,vibe get,vibe var set,vibe secret set. Re-runs are trivial; secrets stay write-only. - Claude Code skill + MCP-tool eval support. Run evals from inside an agent loop, or eval an MCP tool the same way you’d eval a prompt.
what changed
The eval loop dropped from “set up a project” to “write five lines of YAML.” For my own work I now run a vibe check before merging anything that touches a prompt. The same suite I write for myself doubles as the regression check an agent can run autonomously.
the lesson
A good DSL beats a good UI when the user is sometimes a human, sometimes an agent. YAML is boring on purpose — it’s the most stable thing both audiences already know how to read and write.
Source: github.com/hev/vibecheck · API keys at vibescheck.io.