This is a post from Robin Sloan’s lab blog & notebook. You can visit the blog’s homepage, or learn more about me.

What's the smallest possible LLM?

June 9, 2025

I enjoyed lis­tening to a recent talk by Scott Aaronson with the irre­sistible title How Much Math Is Knowable?, and it got me thinking about the space-time tradeoff in com­puter pro­gramming.

This is just the thing in which, if you want a func­tion that, say, pro­vides the square of x for the integer inputs 1-10, you can imple­ment it either as

  def square(x)
    return x*x
  end

or as

  SQUARES = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

  def square(x)
    return SQUARES[x]
  end

Your instinct is prob­ably to look at the second ver­sion and think “that’s stupid” but, hold up: maybe you’re working on a com­puter with odd constraints! Maybe your memory is lightning-fast and your processor runs at two hertz. My example is extreme, but plenty of video game pro­gram­mers chose lookup tables over “doing the math”, espe­cially in the early days.

So that’s space vs. time. Every com­pres­sion algo­rithm dances along this axis: store less data, but do some work — often significant — to recon­sti­tute the orig­inal signal.

This got me thinking about LLMs, and the limits of the tradeoff.

We know that the largest LLM for a given corpus would be a lookup table, simply recording the cor­rect next char­acter for every pos­sible input. With a vocab size of 65,000 and a con­text window of 200,000 tokens — roughly Claude 3.7 specs — that would require merely 65,000 ^ 200,000 * 16 bits! Easy! (Definitely feels like one of those “larger than the number of par­ti­cles in the universe” numbers.) (It occurs to me that I really ought to know the par­ti­cles-in-the-universe expo­nent off the top of my head … )

So what’s the the­o­ret­i­cally smallest pos­sible LLM for a given corpus, vocab, con­text window, and loss? Imagine a Master Gar­dener, just as abstract as my super lookup table above, who knows how to grow the most compact, effi­cient pro­gram pos­sible on a trellis of code. How many bits would that ver­sion require? I don’t think the AI com­mu­nity has a way to for­malize this, presently.

Really my ques­tion trans­lates to “what’s the Kol­mogorov com­plexity of the corpus?” and, likewise, humanity can’t artic­u­late this value for interestingly-complex arti­facts and processes. I wish we could! I’d like to know the Kol­mogorov com­plexity of the world economy. Of Earth’s ecosystem!

I’m curious even about orders of magnitude. I’ve lost track of how big the fron­tier models are on disk … 500 gigabytes? Do researchers at those labs imagine you could fit a “perfect” Claude 3.7 into … half that space? The “time” in the space-time tradeoff has kicked in for these models recently, with the inference-time “reasoning” revolution; how far could you push it, given, literally, all the time in the world?

There must be limits here — right? Could our imag­i­nary Master Gar­dener cram Claude 3.7 into 10 gigabytes? Surely not. But why not?

Fun to think about. I wel­come any pointers to research and/or per­sonal musings — email’s at the top, as always.

Update: There has indeed been big new work on the space-time tradeoff!

With his new simulation, Williams had proved a pos­i­tive result about the com­pu­ta­tional power of space: Algo­rithms that use rel­a­tively little space can solve all prob­lems that require a some­what larger amount of time. Then, using just a few lines of math, he flipped that around and proved a neg­a­tive result about the com­pu­ta­tional power of time: At least a few prob­lems can’t be solved unless you use more time than space.

Thanks to Frank Lantz for the link.

To the blog home page