← case studies

vibecheck — a YAML-first eval framework for any LLM

An agent-evaluation framework built around a simple YAML DSL. Compare models, save suites, mix string matching with semantic and LLM-judge checks, and run multi-model evals from the command line. Open source CLI, hosted service in invite-only preview at vibescheck.io.

role
Solo build — design, engineering, release
date
October 6, 2025
stack
TypeScriptnpm CLIYAML DSLMulti-provider LLM (OpenRouter)Semantic + LLM-judge checksClaude Code skill / MCP testing

the problem

Evals are how you stop debugging-by-vibe and start shipping with confidence — but the tooling forces a choice between heavyweight platforms (good, but a commitment) and ad-hoc scripts (fast, but disposable). For a solo builder or a small team that just needs to ask “did this prompt change make things better or worse, on which models?”, the gap is real.

what shipped

vibecheck — a CLI with a YAML DSL designed for the tightest possible iteration loop.

metadata:
  name: hello-world
  model: anthropic/claude-3.5-sonnet

evals:
  - prompt: Say hello
    checks:
      - match: "*hello*"
      - min_tokens: 1
      - max_tokens: 50
vibe check -f hello-world.yaml

Things that fell out of that DSL:

what changed

The eval loop dropped from “set up a project” to “write five lines of YAML.” For my own work I now run a vibe check before merging anything that touches a prompt. The same suite I write for myself doubles as the regression check an agent can run autonomously.

the lesson

A good DSL beats a good UI when the user is sometimes a human, sometimes an agent. YAML is boring on purpose — it’s the most stable thing both audiences already know how to read and write.

Source: github.com/hev/vibecheck · API keys at vibescheck.io.