This is a post from Robin Sloan’s lab blog & notebook. You can visit the blog’s homepage, or learn more about me.

Is it okay?

February 11, 2025
Macbeth Consulting the Witches, 1825, Eugène Delacroix
Macbeth Consulting the Witches, 1825, Eugène Delacroix

How do you make a lan­guage model? Goes like this: erect a trellis of code, then allow the real pro­gram to grow, its devel­op­ment guided by a gru­eling training process, fueled by reams of text, mostly scraped from the internet. Now. I want to take a moment to think together about a ques­tion with no remaining prac­tical impor­tance, but per­sis­tent moral urgency:

Is that okay?

The ques­tion doesn’t have any prac­tical impor­tance because the AI com­pa­nies — and not only the com­pa­nies, but the enthu­si­asts, all over the world — are going to keep doing what they’re doing, no matter what.

The ques­tion does still have moral urgency because, at its heart, it’s a ques­tion about the things people all share together: the hows and the whys of humanity’s common inheritance. There’s hardly any­thing bigger.

And, even if the com­pa­nies and the enthu­si­asts ram­page ahead, there are still plenty of us who have to make per­sonal deci­sions about this stuff every day. You gotta take care of your own soul, and I’m writing this because I want to clarify mine.


A few ground rules.

First, if you (you engineer, you AI acolyte!) think the answer is obvi­ously “yes, it’s okay”, or if you (you journalist, you media executive!) think the answer is obvi­ously “no, it’s not okay”, then I will sug­gest that you are not thinking with suf­fi­cient sen­si­tivity and imag­i­na­tion about some­thing truly new on Earth. Nothing here is obvious.

Second, I’d like to pro­ceed by depriving each side of its best weapon.

On the side of “yes, it’s okay”, I will insist that the analogy to human learning is not admissible. “Don’t people read things, and learn from them, and pro­duce new work?” Yes, but speed and scale always influ­ence our judg­ments about safety and permissibility, and the speed and scale of machine learning is off the charts. No human, no matter how well-read, could ever field requests from a mil­lion other people, all at once, forever.

On the side of “no, it’s not okay”, I will set aside any argu­ments grounded in copy­right law. Not because they are irrelevant, but because … well, I think modern copy­right is flawed, so a vic­tory on those grounds would be thin, a bit sad. Instead, I’ll defer to deeper precedents: the intu­itions and aspi­ra­tions that gave rise to copy­right in the first place. To pro­mote the Progress of Sci­ence and useful Arts, remember?

I hope par­ti­sans of both sides will agree this is a fair swap. Put down your weapons, and let’s think together.


I want to go carefully, step by step — yet I want to do so with brevity. Lan­guage models pro­duce so … many … WORDS, and they seem to coax just as many out of their critics. Log­or­rhea begets logorrhea. We can do better.

I’ll begin with my sense of what lan­guage models are doing. Here it is: lan­guage models col­late and pre­cip­i­tate all the diverse rea­sons for writing, across a huge swath of human activity and aspiration. Count off those rea­sons: to inform, to persuade, to sell this stupid alarm clock, to dump the CUSTOMERS table into a CSV file … and you realize it’s a vast field of desire and action, impos­sible to hold in your head.

The lan­guage models have many heads.

To make this work — you already know this, but I want to under­score it — only a truly rich trove of writing suffices. Train a lan­guage model on all of Shakespeare’s works and you won’t get any­thing useful, just a brittle Shakespeare imitator.

In fact, the only trove known to pro­duce note­worthy capa­bil­i­ties is: the entire internet, or close enough. The whole extant com­mons of human writing. From here on out, for brevity, we’ll call it Every­thing.

This is what makes these lan­guage models new: there has never, in human history, been a way to oper­a­tionalize Every­thing. There’s never been any­thing close.

Just as, above, I set copy­right aside, I want also to set aside fair use and the public domain. Again, not because they are irrelevant, but because those intu­itions and frame­works all assume we are talking about using some part of the com­mons — not all of it.

I mean: ALL of it!

If lan­guage models worked like car­toon villains, slurping up Every­thing and tainting it with techno-ooze, our judg­ment would be easy. But of course, dig­i­ti­za­tion is trickier than that: the airy touch of the copy com­pli­cates the sce­nario.

The lan­guage model reads Every­thing, and leaves Every­thing unchanged — yet sud­denly this new thing exists, with strange and formidable powers.

Is that okay?


As we begin to feel our way across truly new terrain, we can inquire: how much of the value of these models comes from Every­thing? If the frac­tion was just one percent, or even ten, then we wouldn’t have much more to say.

But the frac­tion is, for sure, larger than that.

What goes into a lan­guage model? Data and compute.

For the fron­tier models like Claude, data means: Every­thing.

Com­pute com­bines two pursuits:

  1. software: the trel­lises and appli­ca­tions that sup­port the devel­op­ment and deploy­ment of these models, and

  2. hardware: the vast sultry data centers, stocked with chips, that give them room to run

There’s a lot of value in those pursuits; I don’t take either for granted, or the labor they require. The expe­ri­ence you get using a model like Claude depends on an inge­nious scaffolding. Truly! At the same time: I believe anyone who works on these models has to con­cede that the trel­lises and the chips, without data, are empty vessels. Inert.

Rea­son­able people can dis­agree about how the value breaks down. While I believe the rel­a­tive value of Every­thing in this mix is some­thing close to 90%, I’m willing to con­cede a 50/50 split.

And here is the impor­tant thing: there is no substitute.

You’ve prob­ably heard about the race to gen­erate novel training data, and all the inter­esting effects such data can have. It is some­times lost in those dis­cus­sions that these sophis­ti­cated new cur­ricula can only be pro­vided to a lan­guage model already trained on Every­thing. That training is what allows it to make sense of the new material.

Also, it is often the case — not always, but often — that the novel training data is gen­erated by … a lan­guage model … which has itself been trained on … you guessed it.

It’s Every­thing, all the way down.

Would it be pos­sible to com­mis­sion a fresh body of work, Every­thing’s equal in scale and diversity, without any of the encum­brances of the com­mons? If you could do it, and you trained a clean-room model on that writing alone, I con­cede that my ques­tion would be moot. (There would be other ques­tions! Just not this one.) Certainly, with as much money as the AI com­pa­nies have now, you’d expect they might try. We know they are already paying to pro­duce new content, lots of it, across all sorts of busi­ness and technical domains.

But this still wouldn’t match the depth and rich­ness of Every­thing. I have a hypothesis, which nat­u­rally might be wrong: that it is pre­cisely the naivete of Every­thing, the fact that its writing was actu­ally pro­duced for all those dif­ferent rea­sons, that makes it so valu­able. Com­posing a fake cor­po­rate email, knowing it will be used to train a lan­guage model, you’re not doing nothing, but you’re not doing the same thing as the real email-writer. Your doc­u­ment doesn’t have the same … what? The same grain. The same umami.

Maybe one of these com­pa­nies will spend ten bil­lion dol­lars to com­mis­sion a whole new internet’s worth of text and prove me wrong. However, I think there are infor­ma­tion-theoretic rea­sons to believe the results of such a project would disappoint them.


So! Under­standing that these models are reliant on Every­thing, and derive a large frac­tion of their value from it, one judg­ment becomes clear:

If their pri­mary appli­ca­tion is to pro­duce writing and other media that crowds out human composition, human production: no, it’s not okay.

For me, this is intuitively, almost viscerally, obvious. Here is the ulti­mate act of pulling the ladder up behind you, a giant “fuck you” to every human who ever wanted to accom­plish any­thing, who matched desire to action, in writing, part of Every­thing. Here is a tech­nology founded in the com­mons, working to under­mine it. Immanuel Kant would like a word.

Fine. But what if that isn’t the pri­mary appli­ca­tion? What if lan­guage models, by col­lating and pre­cip­i­tating all the diverse rea­sons for writing, become flex­ible general-purpose reasoners, and most of their “output” is never actu­ally read by anyone, instead run­ning silent like the elec­tricity in your walls?

It’s pos­sible that lan­guage models could go on broad­ening and deep­ening in this way, and even­tu­ally become valu­able aids to sci­ence and tech­nology, to med­i­cine and more.

This is tricky — it’s so, so tricky — because the claim is both (1) true, and (2) convenient. One wishes it wasn’t so convenient. Can’t these com­pa­nies simply promise, with every passing year, that AI super sci­ence is just around the corner … and meanwhile, wreck every cre­ative industry, flood the internet with garbage, grow rich on the value of Every­thing? Let us cook—while cul­ture fades into a sort of oatmeal sludge.

They can do that! They prob­ably will. And the claim might still be true.

If super sci­ence is a possibility — if, say, Claude 13 can help deliver cures to a host of diseases — then, you know what? Yes, it is okay, all of it. I’m not sure what kind of person could insist that the main­te­nance of a media status quo trumps the erad­i­ca­tion of, say, most cancers. Couldn’t be me. Fine, wreck the arts as we know them. We’ll invent new ones.

(I know that seems awfully consequentialist. Would I sacrifice any­thing, or everything, for super sci­ence? No. But art and media can find new forms. That’s what they do.)

Obviously, this sce­nario is espe­cially appealing if the super sci­ence, like Every­thing at its foundation, flows out into the com­mons. It should.

So — is super sci­ence really on the menu? We don’t have any way of knowing; not yet. Things will be clearer in a few years, I think. There will either be real unde­ni­able glimmers, reported by sci­en­tists putting lan­guage models to work, or there will still only be visions.

For my part, I think the chance of super sci­ence is below fifty percent, owing mostly to the fric­tion of the real phys­ical world, which the lan­guage models have, so far, avoided. But, I also think the chance is above ten percent, so, I remain curious.

It’s not unrea­son­able to find this wager suspicious, but if you do, I might ask: is there any pos­sible-but-unproven tech­nology that you think is worth pur­suing even at the cost of itchy uncer­tainty in the present? If the answer is “yes, just not this one”: fair enough. If the answer is “no”: aha! I see you’ve answered the ques­tion at the top of this page for yourself already.


Where does this leave us?

I suppose it’s not sur­prising, in the end:

If an AI appli­ca­tion delivers some pro­found public good, or even if it might, it’s prob­ably okay that its value is rooted in this unprece­dented oper­a­tional­iza­tion of the com­mons.

If an AI appli­ca­tion simply repli­cates Every­thing, it’s prob­ably not okay.

I’ll sketch out my cur­rent opin­ions more specifically:

I think the image gen­er­a­tion models, trained on the Every­thing of pictures, are: prob­ably not okay. They don’t do any­thing except make more images. They pee in the pool.

I think the fron­tier models like Claude are: prob­ably okay. If it seemed, a couple of years ago, that they were going to be used mainly to barf out text, that impres­sion has faded. It’s clear their appli­ca­tions are diverse, and often have more to do with processes than end products.

The case of trans­la­tion is compelling. If lan­guage models are, indeed, the Babel fish, they might jus­tify the oper­a­tional­iza­tion of the com­mons even without super sci­ence.

I think the case of code is espe­cially clear, and, for me, basi­cally settled. That’s both (1) because of where code sits in the cre­ative process, as an inter­me­diate product, the thing that makes the thing, and (2) because the com­mons of open-source code has car­ried the expec­ta­tion of rich and sur­prising reuse for decades. I think this appli­ca­tion has, in fact, already passed the threshold of “pro­found public good”: opening up pro­gramming to whole new groups of people.

But, again, it’s impor­tant to say: the code only works because of Every­thing. Take that data away, train a model using GitHub alone, and you’ll get a far less useful tool.

Maybe (it turns out) I’m less inter­ested in lit­i­gating my foun­da­tional ques­tion and more inter­ested in simply insisting on the overwhelming, irre­place­able con­tri­bu­tion of this great cen­tral treasure: all of us, writing, for every con­ceiv­able reason; desire and action, impos­sible to hold in your head.


Did we make progress here? I think so. It’s pos­sible my ques­tion, at the outset, seemed broad. In fact, it’s fairly narrow, about this core mechanism, the oper­a­tional­iza­tion of the com­mons: whether I can live with it, or not.

One extreme: if these machines churn through all media, and then, in their deploy­ment, blow away any prospect for a healthy market for human-made media, I’d say, no, that’s not what we want from tech­nology, or from our future.

Another extreme: if these machines churn through all media, and then, in their deploy­ment, dis­cover sev­eral super­con­duc­tors and cure all cancers, I’d say, okay … we’re good.

What if they do both? Well, it would be a bummer for media, but on bal­ance I’d take it. There will always be ways for artists to get out ahead again. More on that in another post.

I also think there are some poten­tial policy reme­dies that would even out the allo­ca­tion of value here — although, these days, imag­ining inter­esting policy is a sort of fan­tas­tical entertainment. Even so, I’ll post about those later, too.

In this discussion, I set copy­right and fair use aside. I should say, however, that I’m not at all inter­ested in clearing the air for AI com­pa­nies, legally. They’ve chosen to plunge ahead into new terrain — so let them enjoy the fog of war, Civ-style. Let them cook!

To the blog home page