Then • Now

Then: I asked for everything at once. Long prompts, multiple tasks, vague goals. I thought more tokens meant better results. The model thrashed, I got inconsistent outputs, and I blamed the model.

Now: I ask for one thing, precisely. Single-purpose prompts, explicit formats, clear success criteria. Compression improves fidelity.

This shift isn't about "prompt engineering tricks." It's about respecting the model's operational constraints and designing around them.

This note documents the phase transition from kitchen-sink prompting to surgical prompting.

What I Observed

Early prompts were maximalist. "Read this document and extract key entities and summarize the main points and identify sentiment and list any action items and format as JSON." I thought bundling tasks was efficient. Instead, outputs were inconsistent—sometimes I got entities but no summary, sometimes valid JSON with missing fields, sometimes prose instead of structure. The model wasn't failing; it was choosing randomly among conflicting instructions.

Breaking tasks apart changed everything. One prompt for entity extraction (temperature 0, JSON schema). One prompt for summarization (temperature 0.7, prose). One prompt for sentiment (three-class classification, logit bias). Each prompt succeeded independently. I composed the outputs downstream. The system became predictable because each component had a single, well-defined job.

Shorter prompts often produce better results. I used to think context was free—dump in examples, background info, tangential constraints. But long prompts dilute attention. The model spreads probability mass across too many possible interpretations. When I cut a 1000-token prompt to 200 tokens by removing fluff and focusing on the core task, output quality improved and latency dropped. Less wasn't just faster; it was clearer.

Precision beats verbosity. Instead of "summarize this in a friendly, engaging tone that captures the essence," I now write "summarize in 3 bullet points, <60 words each, present tense." Constraints are specific and measurable. The model doesn't guess what I want—it optimizes for stated criteria. Vague language invites variance. Precise language invites compliance.

Why It Happens

Multi-task prompts create competing gradients. The model is predicting tokens that maximize likelihood across all instructions simultaneously. "Extract entities" pulls the output toward structured lists. "Summarize" pulls toward prose. "Format as JSON" pulls toward syntax trees. These objectives aren't naturally aligned. The model samples from a compromised distribution—a local optimum that partially satisfies all constraints but fully satisfies none.

Long prompts diffuse attention. Transformer attention is a weighted sum over all input tokens. Each token in the prompt competes for attention weight. When you have 1000 tokens of context, the model allocates less weight to any single instruction. Critical constraints get drowned in noise. Shorter prompts concentrate attention on what matters. The model's limited capacity focuses on the task, not the preamble.

Vague language maps to high-entropy distributions. "Friendly tone" doesn't correspond to a single token pattern—it's a vast, underspecified space. The model samples from that space, and you get variance. "Present tense, <60 words" is low-entropy—it rules out most tokens. The model's search space collapses. Precision narrows the distribution; vagueness widens it. You observe variance as a direct consequence of under-specification.

Compositional design exploits modularity. If task A and task B are independent, running them separately allows each to use full model capacity. Bundling them forces the model to allocate capacity between tasks. Separation is almost never slower (latency from sequential calls is dominated by model forward passes, not round trips), and it's always more predictable. You trade a small latency overhead for large reliability gains.

What I Do Now

I decompose every task into single-purpose prompts. If I catch myself using "and" in an instruction, I split it. "Extract X and summarize Y" becomes two prompts. "Generate Z and format as W" becomes generate + format. Each prompt has one job. I compose the outputs in code, not in the prompt. The model does primitives; I do orchestration.

I aggressively minimize token count. Every sentence in the prompt must justify its existence. Examples stay only if they're disambiguating edge cases. Background context stays only if it changes the output. I cut until the prompt is skeletal, then I test. If quality holds, the cut stays. If it degrades, I add back the minimum needed. The default is brevity; verbosity requires proof.

I replace vague language with measurable constraints. Instead of subjective adjectives ("clear," "engaging," "professional"), I use structural specs ("5 bullets," "present tense," "no jargon"). Instead of "summarize the key points," I write "extract the 3 main claims, <80 words total." If I can't measure it, I can't automate verification, and if I can't verify, I can't trust the output at scale.

I version prompts like code. Every change gets tested against a regression suite (fixed inputs, compare outputs). I track metrics: output variance (BLEU across runs), validation pass rate, latency. When I simplify a prompt, I measure impact. This isn't "find the magic words"—it's systematic optimization. Prompts are infrastructure, not art.

Practical Checklist

  • Decompose multi-task prompts: if you use "and" in an instruction, split it into separate prompts and compose outputs programmatically
  • Minimize token count aggressively—cut every sentence that doesn't change the output, test to verify quality holds
  • Replace vague language ("friendly," "clear") with measurable constraints ("5 bullets," "present tense," "<60 words")
  • Version prompts like code: regression tests on fixed inputs, track variance/pass-rate/latency metrics for every change
  • Default to brevity: verbosity must justify itself through A/B tests showing quality improvement

Glossary

Multi-Task Prompt
Single prompt requesting multiple operations (e.g., "extract and summarize"). Creates competing optimization gradients, reduces consistency.
Attention Diffusion
Phenomenon where long prompts spread attention weight thinly across tokens, reducing focus on critical instructions.
High-Entropy Language
Vague instructions like "friendly tone" that map to broad, underspecified token distributions, increasing output variance.
Compositional Design
Breaking tasks into single-purpose prompts executed sequentially, composed programmatically. Trades latency for reliability.
Measurable Constraint
Specification that can be verified programmatically (e.g., "5 bullets," "<60 words") vs. subjective criteria ("engaging").