Embeddings

In my last post I walked through five layers of token optimization that took a product classifier from $200+/month down to $25–40: context compression, two-stage prompting, exact-match lookup, similarity caching, and batching. Each layer attacked either the size of the context or the number of LLM calls. The post ended with the bill mostly tamed. But there was a layer I hadn’t built yet, and it’s the one that interests me most in hindsight, because it inverts the whole relationship: instead of making the LLM cheaper, you train a model to not need the LLM at all for most of the work — using the LLM’s own past output as training data. ...