This is a post from Robin Sloan’s lab blog & notebook. You can visit the blog’s homepage, or learn more about me.

Software speed and the chat illusion

September 8, 2025

It’s inter­esting to try to build things that are

  1. powered, in part, by LLMs, and
  2. not chatbots.

The expe­ri­ence has helped me realize that the chat con­text is, among many other things, a handy illu­sion that con­ceals a great weak­ness of cur­rent LLMs: their slowness.

If you com­pare an LLM to a human at a keyboard, the speed of its output is blazing — unthinkable. If you com­pare it to … any other com­puter program … it’s glacial. Sud­denly we are on dialup again.

The genius — perhaps accidental — of the chat con­text is that it

  1. gives you a “cover story” for streaming your response, rather than waiting to show it all at once, and
  2. suggests the comparison to a human correspondent, rather than a computer program.

But, when you want an LLM to do some­thing other than chat — say, look at an image, make some judgments, respond with a JSON object — both of those tricks are nullified, and all you can do is wait for a complete, com­puter-y response.

I’ve been using the API for Gemini 2.5 Flash, which is, honestly, an incred­ible tool: capable, inexpensive, and, com­pared to its peers, VERY fast. Yet even Flash’s 3-4 second responses are too slow for inter­ac­tive applications. I shouldn’t say “too slow”; rather, just that they impose a serious tax, a return to the sort of PROCESSING … PROCESSING …  that is, these days, less familiar and, perhaps, less tolerable.

Every­thing is slow before it’s fast. I’m only men­tioning this expe­ri­ence because it has clarified, for me, the ubiq­uity of the chat con­text for LLMs, which is suc­cessful because it obscures the fact that the state of the art is still: slow.

To the blog home page