Most of the attention in AI still goes to the giants — the frontier models with parameter counts that read like phone numbers. But spend time around teams actually shipping products, and you notice a quieter trend pulling in the opposite direction. The models are getting smaller, and that turns out to matter more than it sounds.
A small language model that runs directly on a phone, a laptop, or a piece of factory equipment changes the economics of building with AI. There is no per-token bill that scales with success. There is no round trip to a distant data centre, so latency drops from hundreds of milliseconds to almost nothing. And because the data never leaves the device, an entire category of privacy and compliance headaches simply evaporates. For regulated industries — healthcare, finance, anything touching personal data — that last point alone is enough to justify the shift.
The trade-off, of course, is capability. A compact model will not reason through a gnarly novel problem the way a frontier system can. But here is the thing most teams discover: they were never asking the frontier question in the first place. Classifying a support ticket, summarising a meeting, extracting fields from a form — these are bread-and-butter tasks, and a well-tuned small model handles them happily.
The smart architecture emerging now is tiered. Route the routine work to a local model, and escalate only the genuinely hard cases to a larger one in the cloud. It is the same instinct good engineers have always had: use the cheapest tool that does the job, and save the heavy machinery for when it earns its keep.
