Software speed and the chat illusion

September 8, 2025

It’s interesting to try to build things that are

powered, in part, by LLMs, and
not chatbots.

The experience has helped me realize that the chat context is, among many other things, a handy illusion that conceals a great weakness of current LLMs: their slowness.

If you compare an LLM to a human at a keyboard, the speed of its output is blazing — unthinkable. If you compare it to … any other computer program … it’s glacial. Suddenly we are on dialup again.

The genius — perhaps accidental — of the chat context is that it

gives you a “cover story” for streaming your response, rather than waiting to show it all at once, and
suggests the comparison to a human correspondent, rather than a computer program.

But, when you want an LLM to do something other than chat — say, look at an image, make some judgments, respond with a JSON object — both of those tricks are nullified, and all you can do is wait for a complete, computer-y response.

I’ve been using the API for Gemini 2.5 Flash, which is, honestly, an incredible tool: capable, inexpensive, and, compared to its peers, VERY fast. Yet even Flash’s 3-4 second responses are too slow for interactive applications. I shouldn’t say “too slow”; rather, just that they impose a serious tax, a return to the sort of PROCESSING … PROCESSING … that is, these days, less familiar and, perhaps, less tolerable.

Everything is slow before it’s fast. I’m only mentioning this experience because it has clarified, for me, the ubiquity of the chat context for LLMs, which is successful because it obscures the fact that the state of the art is still: slow.

To the blog home page