Determinism

You can't make a language model deterministic by setting temperature to zero. You can only narrow the output distribution. True determinism requires architectural constraints, not just parameter tweaks.

The goal isn't identical outputs every time—it's predictable variance. Control the degrees of freedom, and randomness becomes a feature, not a bug.

Determinism emerges from structure: strict formats, explicit validation, and compositional prompts that isolate non-deterministic steps.

This note documents how to build deterministic systems on top of probabilistic foundations.

What I Observed

Temperature 0 doesn't guarantee determinism. Run the same prompt twice, and you'll usually get identical outputs—but not always. API-level factors (load balancing across model shards, floating-point precision in softmax, tokenization edge cases) introduce small variations. These variations are rare but non-zero.

Structured output formats radically increase consistency. Ask for JSON, and the model's output space collapses from "all possible text" to "syntactically valid JSON matching the schema." The same prompt generates functionally identical results across runs, even if token-level details differ. The constraint does the work, not the temperature setting.

Validation layers create deterministic checkpoints. If you generate structured output and parse it with a strict schema, invalid outputs fail immediately. You can retry with the same prompt, increasing temperature to explore alternatives, until you get a valid result. The final output is deterministic relative to the validation criteria, even if the generation process was stochastic.

Multi-step prompts isolate variance. Break a task into deterministic steps (extract entities) and creative steps (generate descriptions). Run the deterministic steps at temperature 0 with strict formats. Run the creative steps at higher temperature with looser constraints. Variance exists only where you explicitly allow it. The system as a whole becomes predictable because you control where randomness enters.

Why It Happens

Temperature 0 makes the model greedy—it always picks the highest-probability token. But "highest" is computed from floating-point operations on massive matrices. Tiny numerical differences (rounding errors, distributed computation variance) can flip which token ranks first. These differences are usually invisible, but they're non-zero. Temperature 0 reduces entropy; it doesn't eliminate it.

Structured formats constrain the solution space geometrically. The model isn't generating "text"—it's navigating a probability distribution over token sequences. When you specify JSON, you're ruling out entire regions of that distribution. Only sequences matching the syntax tree are valid. The model's search space shrinks by orders of magnitude. Determinism emerges from constraint, not from the model's internals.

Validation acts as a projection operator. You're collapsing a noisy, high-dimensional output (raw text) onto a lower-dimensional manifold (valid structures). Invalid outputs get rejected and regenerated. The Monte Carlo process converges to the constrained space. From the outside, it looks deterministic—you only see the valid outputs, not the generation attempts.

Multi-step decomposition exploits conditional independence. If step B doesn't depend on randomness from step A, you can make A deterministic and allow B to vary. The overall system becomes modular. You control the dataflow between steps, and that control structure enforces determinism where it matters while preserving flexibility where it's useful.

What I Do Now

I use structured output formats (JSON, YAML, XML) for any task where consistency matters. I don't ask for "information about X"—I ask for a JSON object with specific keys. The schema is part of the prompt. The model generates to fit the constraint, and I parse with a strict validator. If parsing fails, I regenerate. The output is deterministic relative to the schema, not the raw text.

I separate deterministic and creative steps explicitly. If I need to extract data and then summarize it, I run two prompts: one at temperature 0 with strict JSON output (extraction), one at temperature 0.7 with prose output (summary). The extraction is reproducible. The summary varies slightly, but it's anchored to deterministic extracted data. I control which layer has variance.

I validate outputs programmatically, not manually. Every model response passes through a parser or schema validator. Invalid responses trigger automatic retries with adjusted prompts (add examples, tighten constraints, reduce token budget). The retry loop is part of the system design, not a fallback. Determinism is a property of the system, not the model.

I version prompts and log outputs. When I change a prompt, I re-run the test suite (same inputs, compare outputs). If determinism breaks, I bisect the prompt changes to find what introduced variance. Treat prompts like code: version control, diff reviews, regression tests. Determinism is something you verify continuously, not assume.

Practical Checklist

Use structured output formats (JSON/XML) with explicit schemas instead of free-form text for critical tasks
Separate deterministic steps (data extraction) from creative steps (summarization) with explicit temperature boundaries
Validate all model outputs with strict parsers—treat invalid outputs as retriable errors, not final results
Version control your prompts and maintain regression test suites to detect determinism drift
Build retry loops into the system design: invalid → adjust prompt → regenerate until valid

Glossary

Greedy Decoding: Temperature 0 behavior where model always selects highest-probability token. Reduces but doesn't eliminate variance due to numerical precision.
Structured Output: Constraining model responses to specific formats (JSON, XML, YAML) to reduce output space and increase consistency.
Validation Layer: Programmatic parser/schema that accepts or rejects model outputs based on structural criteria, enabling deterministic retry loops.
Conditional Independence: Property where one step's randomness doesn't affect another step, allowing selective determinism in multi-step systems.
Prompt Versioning: Treating prompts as code: version control, diff reviews, regression testing to maintain determinism over time.