Software speed and the chat illusion
It’s interesting to try to build things that are
- powered, in part, by LLMs, and
- not chatbots.
The experience has helped me realize that the chat context is, among many other things, a handy illusion that conceals a great weakness of current LLMs: their slowness.
If you compare an LLM to a human at a keyboard, the speed of its output is blazing —
The genius —
- gives you a “cover story” for streaming your response, rather than waiting to show it all at once, and
- suggests the comparison to a human correspondent, rather than a computer program.
But, when you want an LLM to do something other than chat —
I’ve been using the API for Gemini 2.5 Flash, which is, honestly, an incredible tool: capable, inexpensive, and, compared to its peers, VERY fast. Yet even Flash’s 3-4 second responses are too slow for interactive applications. I shouldn’t say “too slow”; rather, just that they impose a serious tax, a return to the sort of PROCESSING … PROCESSING … that is, these days, less familiar and, perhaps, less tolerable.
Everything is slow before it’s fast. I’m only mentioning this experience because it has clarified, for me, the ubiquity of the chat context for LLMs, which is successful because it obscures the fact that the state of the art is still: slow.
To the blog home page