What's the smallest possible LLM?
I enjoyed listening to a recent talk by Scott Aaronson with the irresistible title How Much Math Is Knowable?, and it got me thinking about the space-time tradeoff in computer programming.
This is just the thing in which, if you want a function that, say, provides the square of x for the integer inputs 1-10, you can implement it either as
def square(x)
return x*x
end
or as
SQUARES = [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
def square(x)
return SQUARES[x]
end
Your instinct is probably to look at the second version and think “that’s stupid” but, hold up: maybe you’re working on a computer with odd constraints! Maybe your memory is lightning-fast and your processor runs at two hertz. My example is extreme, but plenty of video game programmers chose lookup tables over “doing the math”, especially in the early days.
So that’s space vs. time. Every compression algorithm dances along this axis: store less data, but do some work —
This got me thinking about LLMs, and the limits of the tradeoff.
We know that the largest LLM for a given corpus would be a lookup table, simply recording the correct next character for every possible input. With a vocab size of 65,000 and a context window of 200,000 tokens —
So what’s the theoretically smallest possible LLM for a given corpus, vocab, context window, and loss? Imagine a Master Gardener, just as abstract as my super lookup table above, who knows how to grow the most compact, efficient program possible on a trellis of code. How many bits would that version require? I don’t think the AI community has a way to formalize this, presently.
Really my question translates to “what’s the Kolmogorov complexity of the corpus?” and, likewise, humanity can’t articulate this value for interestingly-complex artifacts and processes. I wish we could! I’d like to know the Kolmogorov complexity of the world economy. Of Earth’s ecosystem!
I’m curious even about orders of magnitude. I’ve lost track of how big the frontier models are on disk … 500 gigabytes? Do researchers at those labs imagine you could fit a “perfect” Claude 3.7 into … half that space? The “time” in the space-time tradeoff has kicked in for these models recently, with the inference-time “reasoning” revolution; how far could you push it, given, literally, all the time in the world?
There must be limits here —
Fun to think about. I welcome any pointers to research and/or personal musings —
Update: There has indeed been big new work on the space-time tradeoff!
With his new simulation, Williams had proved a positive result about the computational power of space: Algorithms that use relatively little space can solve all problems that require a somewhat larger amount of time. Then, using just a few lines of math, he flipped that around and proved a negative result about the computational power of time: At least a few problems can’t be solved unless you use more time than space.
Thanks to Frank Lantz for the link.
To the blog home page