Verification

Manual review doesn't scale. If you're reading every AI output to check correctness, you're not using AI—you're using a very expensive draft generator.

Verification must be automated, layered, and probabilistic. You can't guarantee correctness, but you can bound error rates and detect drift.

The key is designing verification strategies that are cheaper than the task itself while maintaining acceptable false-negative rates.

This note catalogs practical verification patterns for production AI systems.

What I Observed

Structural validation catches 80% of errors before you even look at semantics. If you expect JSON and get malformed syntax, that's an instant fail. If you expect five bullet points and get three paragraphs, reject it. Most AI errors manifest as format violations. Schema validation is fast, deterministic, and composable.

Cross-model verification works surprisingly well. Generate the same output with two different models (GPT-4 and Claude, or two temperature settings). If they disagree significantly, flag for review. Agreement doesn't guarantee correctness, but persistent disagreement reliably indicates ambiguity. The cost is 2x generation, but manual review is 10x slower.

Self-consistency checks exploit model uncertainty. Ask the model to generate an answer, then critique it. If the critique identifies flaws, regenerate. Extract claims and query the model about contradictory evidence to detect confident wrong answers. Uncertainty leaks through even when the output looks confident.

Programmatic assertions work like unit tests for AI. If the task is "summarize this document," verify character count is under the limit, verify section headers appear in keywords, verify no new entities are introduced. These checks don't prove the summary is good, but they prove it's not catastrophically wrong.

Why It Happens

Structural errors emerge because the model is predicting tokens, not parsing schemas. It learns patterns of valid JSON from training data, but it's not running a JSON validator internally. When the model makes a mistake, it often cascades into syntax errors. Malformed outputs are artifacts of the prediction process breaking down. Catching structural errors is catching the model mid-failure, before semantic evaluation is even meaningful.

Cross-model disagreement surfaces prompt ambiguity. Different models have different training data, different architectures, different inductive biases. When they agree, it suggests the prompt constrains the output sufficiently that the answer is robust. When they disagree, it suggests the prompt is under-specified and the models are filling in gaps differently. Disagreement isn't validation—it's a canary for bad prompts.

Self-critique works because the model has "meta-reasoning" capacity. It can evaluate its own outputs using the same mechanisms it used to generate them. If you ask it to critique, you're running the generation process in reverse—predicting what a critic would say, not what the task would produce. The critic token distribution is different from the generator distribution. When they conflict, you've found an inconsistency in the model's internal representation.

Programmatic assertions exploit task structure. Every task has invariants: summaries are shorter than originals, translations preserve entities, extractions only reference source text. These invariants are cheap to verify computationally. You're testing necessary conditions, not sufficient ones. A passing test doesn't mean the output is perfect, but a failing test definitively means something broke. You're doing negative verification: ruling out failure, not proving success.

What I Do Now

I build validation into every output pipeline. Structured formats get schema validation. Text outputs get length checks, keyword presence checks, entity extraction cross-references. If any check fails, I log the failure, regenerate with adjusted constraints, and repeat. Invalid outputs never reach production.

I use cross-model verification for high-stakes tasks. I generate with two models (or the same model at different temperatures), diff the outputs, and flag discord. If agreement is below a threshold, I escalate to manual review or add more constraints and regenerate.

I implement self-consistency loops for factual tasks. After generating an answer, I prompt the model to list potential errors. If it flags issues, I regenerate with those as explicit constraints. If it's confident, I extract key claims and run contradiction checks. Confident wrong answers often reveal contradictions under adversarial prompting.

I write programmatic assertions for task invariants. Every task has a test suite: character limits, entity preservation, format adherence. Outputs run through the suite automatically. Pass rate becomes a metric I track over time. If pass rates drop, I investigate prompt drift or model updates.

Practical Checklist

Implement schema validation (JSON/XML parsers) as the first verification layer—reject malformed outputs instantly
Use cross-model generation (two models or two temperatures) for high-stakes tasks and flag significant disagreements
Add self-consistency loops: prompt the model to critique its own output and regenerate if it identifies flaws
Write programmatic assertions for task invariants (length limits, entity preservation, format rules) and run them automatically
Track verification pass rates over time as a monitoring metric to detect prompt drift or model degradation

Glossary

Structural Validation: Verifying output format (syntax, schema adherence) independent of semantic correctness. Fast, deterministic, catches ~80% of errors.
Cross-Model Verification: Generating same output with multiple models/settings and flagging disagreements as potential errors or prompt ambiguities.
Self-Consistency Check: Prompting model to critique its own outputs or answer contradictory queries to detect uncertainty and hallucinations.
Task Invariants: Necessary conditions for valid outputs (e.g., summaries shorter than source). Cheap to verify computationally, rule out failure modes.
Negative Verification: Proving outputs aren't wrong rather than proving they're correct. Bounds error rates without requiring full semantic validation.