At home in high-dimensional space

April 29, 2024

Twenty-sided die with faces inscribed with Greek letters, 2nd century B.C.E.–4th century C.E.

In this edition of my technical newsletter, I want to share a few thoughts and links that have stacked up in my notes recently.

First, though, I want to be sure you know about Moonbound, my new novel coming in June.

An advance copy of Moonbound, looking very nice on my shelf with its vivid cover. — Moonbound advance copy, aspirational shelving

This novel is more germane to the interests of this newsletter than my previous work, for a few reasons:

Where my previous two novels are notionally realist fiction, though clearly influenced and energized by the vibes of science fiction, this novel is: straightforwardly sci-fi!
My experiments over the years with AI and writing (e.g. this one in 2016, this one in 2022) resulted not in a story written with AI tools — in the end, that approach just was not interesting to me — but rather a story informed by the “grain” of that material, as I’ve come to understand it. It makes a rich, weird vein through Moonbound.
The novel also boasts an in-depth treatment of the mathematics and aesthetics of high-dimensional space. This subject emerges organically in the story; there is, at one point, an urgent need for my protagonist to get good, fast, at high-dimensional thinking! But, I’ll confess, I really love “didactic stories” (like the inimitable anime Dr. Stone, which I raved about here) so I was happy to take the opportunity to tell one.

I’ll now encourage you to preorder Moonbound, which you can do anywhere books are sold, in any format you like, print or digital or audio. Barnes & Noble is a great option; Amazon is, of course, very convenient.

I know you understand very well the power of the algorithm: the way attention compounds. What you might not understand is the relatively modest scale of book publishing success. It only requires sales in the single-digit thousands to pop a book onto the bestseller lists, which can become gateways to further success. The point of the preorder, then, is to focus a diffuse field of interest into the hot week of a book’s release.

Delivering the mail in Val Town

I’ve previously expressed enthusiasm about Val Town. Now, I’ve actually used it for something, and I can report that my enthusiasm has only grown.

Val Town offers a lightweight web editor for TypeScript functions that can run in a variety of ways. For me, the killer app is the built-in email handler, which works like this:

You create a “val” and designate it an email handler. That val is automatically connected to an email address that looks something like this: yourUserName.coolEmailHandler@valtown.email.
You write a function that accepts an Email object. This function will be called automatically when new emails arrive, and it can do anything! Maybe you want to save the message to a database. Maybe you want to parse the body and execute some command. Maybe you want to send a reply — Val Town will happily do that.
You start sending emails to your val!

I think a particularly powerful option here is to pass the body over to a language model, along with a prompt explaining what kind of information you’d like to extract. In this way, an email handler val can become a bridge between the chaotic, unstructured world of email and whatever more formal, schematic requirements you might have.

Indeed, this is part of what’s happening behind the scenes with my preorder registration, described above. I wanted to avoid even the minimal friction of “please fill out this Google Form”, and since an online order is almost always represented by a confirmation email, I wondered, “What if people could just forward those emails along? What if I … didn’t have to read them all myself?”

Getting that flow up and running as a val was easy and fun.

This is not a huge deal; honestly, a Google Form would have been fine. But, I am an incorrigible sender of emails-to-self; it’s how I log and manage most of my notes, including many of the items that you’ll find in this newsletter. So, this experience has got me thinking: how might I enrich that flow, adding some structure along the way? What new flavors of emails-to-self might I conceive, with what kinds of useful “side effects”?

This is all to say, Val Town has fired up my imagination — always a good sign. The company and userbase alike are small enough to feel convivial and responsive; it’s a cool platform at a cool time.

Robin Rendle feels powerfully the romance, the energy, the POTENTIAL of modern CSS, and he expresses all that in his new blog and newsletter, The Cascade. I was a devoted reader of Robin’s contributions to CSS Tricks, so I am all in for this new stream of writing and discovery.

Meta proclaims that its Llama 3 language model was “pretrained on 15T tokens from publicly available sources”.

Fifteen trillion tokens! Call that eleven trillion English words. If, just for fun, we say that 100,000 words equals a book, that’s equivalent to a hundred million books — only slightly less than the total number (estimated by Google Books) to have been published, ever, since the invention of the printing press.

I have to confess, it remains strange to me that the AI folks worried, for so many years, about the composition of their training corpora — not only its legal status but also its structure, its origins and omissions …

… and then, starting sometime around 2021 or 2022, they simply: didn’t.

I think that’s because beggars can’t be choosers, and every engineer of a large model is presently, where data is concerned, a desperate beggar indeed. The requirement for SO MUCH DATA probably reveals something deeply weird, and far from ideal, about the current approach to modeling and training these systems. Yet the connection is clear: they do powerful things when you dump more data into them, so, ideal or not: more data must be had!

A recent edition of Jack Clark’s indispensable AI newsletter discussed the production, to this end, of synthetic data — i.e., data not gathered from the “real world”, but produced expressly for AI training. Sometimes it’s produced using straightforward computer code; sometimes it’s produced by another AI model 😵‍💫

It’s odd to contemplate these vast new “books”—training corpora — that will never be read (could never be read) by any human. They can be understood and evaluated only statistically, through spot-checking or automated analysis.

I’ll never stop saying: it’s tragic that Borges missed all this. He would have loved it.

The digital essay Models All the Way Down by Christo Buschek and Jer Thorp takes a beautiful swing at these complexities. I like the bit that Robin Rendle picked out:

Here we find an important truth about [this dataset]:

It contains less about how humans see the world than it does about how search engines see the world. It is a dataset that is powerfully shaped by commercial logics.

A different activity altogether

When you raise questions about AI training data — anything related to copyright, fair use, attribution, etc. — you’ll often encounter a defense that goes something like this:

What’s the big deal? Robin Sloan “trained” himself on a ton of copyrighted books, didn’t he? He learned from them, then went on to write books of his own, and nobody calls that copyright infringement!

This might be a reasonable argument if AI models operated at the speed and fidelity of human writers and artists. It’s true, Robin Sloan did read a ton of copyrighted books. However, he did not read all the copyrighted books, and even then, the task took him decades. Furthermore, he generates output at the rate of approximately one book every four years, which works out to approximately one token per hour 😉

When capability increases so substantially, the activity under discussion is not “the same thing, only faster”. It is a different activity altogether. We can look to the physics of phase change for clues here.

Basically, I want to immunize you against this analogy, and this objection. There’s plenty to debate in this nascent field, but any comparison between AI training and human education is just laughably wrong.

This is what digitization does, again and again: by removing friction, by collapsing time and space, it undermines our intuitions about production and exchange.

No human ever metabolized information as completely as these language models. “As our case is new, so we must think anew, and act anew.” You’re gonna need fresh intuitions.

Black box science

What’s the value of these ever-growing AI models, really? I know several people working in this domain who believe the goal is one (1) thing of overriding consequence, which we might call virtual Wolfgang Pauli, or maybe on-demand Albert Einstein: an AI model that can actually produce pathbreaking scientific theory.

In this formulation, economic and social transformation would be a second-order effect: not of “AI itself”, but of technology derived from science it produces.

I think this vision is weirdly more plausible than AI as “general labor replacement”. I suppose you could counter by saying, if they can engineer a virtual Pauli, they can FOR SURE engineer a virtual employee; one task is strictly “easier” than the other. But I don’t know if that’s true. Was Albert Einstein a good employee?

I think about this scenario a lot. For me, it’s more imaginatively compelling, with fewer readymade sci-fi precedents, than “AI leisureworld”.

Reading histories of physics in the early 20th century, it’s thrilling to learn about the intellectual and social ferment, the rich network of “whys” behind every step forward. In an imaginary AI-powered annus mirabilis, those “whys” might be absent. No story; no development; just theory. But also, perhaps, testable predictions, and new explanations for weird phenomena, as consequential as Einstein’s for the precession of Mercury.

But science is a social process; the AI folks understand this very well. How would AI-generated “raw theory” be channeled into the real world of science and technology? How would you know when your virtual Pauli had a theory worth testing? What if it spat out a million theories, and you had good reason to believe one of them was correct — a real paradigm-buster — but you didn’t know which?

I come down on the side of skepticism, but/and … it’s chewy stuff! Fun to think about.

My memory of the terrific TV series Halt and Catch Fire has grown a bit murky, which means it’s almost time for a rewatch; that would make my third viewing. If you’re a technical or technical-adjacent person, you really ought to make time for this one. Don’t let the melodramatic opening episodes throw you — the series quickly matures into one of the best-ever televisual productions about collaboration and creation.

Faience polyhedron inscribed with Greek letters, 2nd-3rd century C.E.

I encourage you to take just a moment to contemplate this twenty-sided die, and the one depicted at the top of this edition, too. These were warmed by palms nearly two thousand years ago. You could play D&D with them today, if the Met would let you.

Durable delights. Computer programs can’t do this — not yet.

To the blog home page