- Published
Use newer AI models to instruct older models
Every few months a new frontier AI model arrives, and the pattern repeats: it is noticeably better at judgment, and noticeably more expensive than the models before it. As of June 2026, Anthropic’s newest Claude model sits at the top of the coding benchmarks, OpenAI’s latest GPT generation powers Codex, and Google’s Gemini Pro models compete hard on price. The previous generation — Opus, Sonnet, older GPT models — is still available, still capable, and much cheaper.
Most people treat this as a choice: pay for the best, or save with the rest. I think that framing is wrong. The setup that actually works is to use them together, with a clear division of labor: the newest model writes the instructions, the older models follow them.
The cost asymmetry, and what it buys
The price gap between a frontier model and the workhorse tier is large enough that running everything on the newest model rarely makes sense for routine work. But the capability gap is not evenly distributed. Older models are close to the frontier on well-defined tasks: write this function, translate this page, follow this checklist. The gap shows up in judgment — handling ambiguity, noticing what is missing, knowing when a rule should not apply.
That asymmetry points to the division of labor. Judgment is needed when decisions are made. Execution mostly needs decisions that have already been made, written down clearly.
So make the decisions once, with the strongest model available, and write them down in a form the cheaper models can follow. This is the same logic as a senior developer writing coding guidelines for a team. You do not need the senior developer in every code review if the guidelines are good enough — you need them when the guidelines run out.
What this looks like in practice
The pattern has four steps:
- Work through the strategy with the newest model. Not “write me an article”, but the underlying decisions: who the audience is, what the rules are, where the edge cases sit, what failure looks like. The frontier model is at its best here, because this is where ambiguity lives.
- Have it write durable instructions. Project instruction files such as
AGENTS.mdorCLAUDE.md, prompts, checklists, style rules, do/don’t examples. The output is not work — it is the playbook for work. - Let cheaper models execute inside those rails. Claude Sonnet, Codex, Gemini, or whatever fits your tooling. Day-to-day tasks, drafts, translations, refactors, content updates — all running on instructions that already contain the judgment.
- Escalate when the rails end. When something is genuinely new, conflicts with an existing rule, or feels strategically important, go back to the frontier model and update the playbook.
If you are new to instruction files and AI coding tools in general, the basics are covered in how to get started programming with AI.
A concrete example
I run two language versions of this website, each targeting a different market. The strategy questions — which market each site serves, how to prevent the two sites’ content from drifting, how to avoid keyword cannibalization between similar articles — were worked through with the newest Claude model, which then wrote the rules into the repository’s AGENTS.md: target audiences, phrasing examples with good and bad versions, an SEO section, and a checklist for new articles.
The point of that work is not that the frontier model wrote nice documentation. It is that a cheaper model can now produce a publish-ready draft on the first attempt, because the instructions already contain the decisions it would otherwise guess at. The expensive model ran once. The cheap models run every day.
What good instructions look like
Writing instructions for a weaker model is a skill, and the frontier models are genuinely good at it when you ask directly. A few properties matter more than length:
- Concrete examples beat abstract rules. “Write in a calm, direct voice” is decoration. A good sentence and a bad sentence, side by side, with one line explaining the difference, changes behavior.
- Checklists beat prose. A numbered list of what a finished task includes is something a small model can verify itself against.
- Required self-checks catch errors early. Rules like “state your conclusion about overlap before writing” force the model to show its reasoning where you can see it.
- Escalation rules prevent silent failure. The instruction “if the task does not fit these rules, stop and say so” is the cheapest insurance available.
The limits
This is not a trick that removes the need for review. Instructions age: facts change, strategies change, and a playbook written in January quietly drifts out of date by summer. Time-sensitive claims — model capabilities, prices, SEO guidance — should stay with the model that can verify them, not be frozen into instructions. And no instruction file makes a small model good at judgment; it only reduces how often judgment is needed.
The comparison between specific tools matters less than this structure. Whether your daily driver is ChatGPT or Claude — a question I covered in ChatGPT vs Claude for developers — the leverage is the same: the newest model is most valuable not as a faster worker, but as the author of the system your cheaper workers run on.
If you want help setting this up for a real project — instruction files, content rules, and a workflow your tools actually follow — get in touch.