Voyages in sentence space

Well, it’s Jan­uary 2020, and the day has finally come: I can no longer operate the soft­ware that made this essay interactive. I’ll leave the text up, but please under­stand that this essay is now con­sid­ered “broken.” That’s the chal­lenge this kind of pre­sen­ta­tion poses: it’s very cool when it works, but then — and this hap­pens eventually, inevitably, entropically — it doesn’t.

Imagine a sen­tence. “I went looking for adventure.”

Imagine another one. “I never returned.”

Now imagine a sen­tence gra­dient between them — not a story, but a smooth inter­po­la­tion of meaning. This is a weird thing to ask for! I’d never even both­ered to imagine an inter­po­la­tion between sen­tences before encoun­tering the idea in a recent aca­d­emic paper. But as soon as I did, I found it captivating, both for the thing itself — a sen­tence … gra­dient? — and for the larger arti­fact it suggested: a dense cloud of sen­tences, all related; a space you might nav­i­gate and explore.

Here’s what a neural net­work instructed to pro­duce such a cloud of sen­tences (specifically, sen­tences from sci­ence fic­tion) delivers when you ask it to draw a gra­dient between “I went looking for adventure.” and “I never returned.”

I won't acknowledge.
I stared after midnight
I stared agitated.
I get definitely.
I never returned.

You can ask it to draw a gra­dient of your own! Just replace the first and last sen­tences and use this button:

So, does that sen­tence gra­dient make sense? I hon­estly don’t know. Is it useful? Prob­ably not! But I do know it’s inter­esting, and the larger arti­fact — the dense cloud, the sen­tence space — feels very much like some­thing worth exploring.

A comfortable embedding

I’ve been exploring neural net­works — in par­tic­ular, neural net­works that gen­erate text — for a while now. (You can find a pre­vious exper­i­ment here.)

When you’re tin­kering with these tools, trying to pro­duce some­thing inter­esting (maybe even artful) from a dataset, whether it’s com­posed of text or images or some­thing else, you often find your­self embed­ding that data into numeric space.

At a super simple level, imagine a dataset con­sisting of color swatches: rusty orange, dusty magenta, deep purple. You can see why it might make sense to embed these stand­alone swatches into a one-dimensional number line, a smooth sweep of color — 

—so each has its own coor­di­nate and there are also, as a sig­nif­i­cant added bonus, coor­di­nates for all the inter­me­diate colors between them.

Imagine a more com­plex dataset con­sisting of more colors. You can see how two dimen­sions might be useful:

Just like that, this dataset becomes some­thing a neural net­work can chomp on, because it’s no longer color swatches described with metaphors, but a set of numbers. You know what com­puters love? SETS OF NUMBERS.

In practice, because datasets are often very rich — images of faces, sen­tences from sci­ence fic­tion stories — the numeric space into which you do this embed­ding will have not just two but dozens or hun­dreds of dimen­sions. I def­i­nitely can’t visu­alize a space like that — but so what? It turns out imag­i­nary spaces are useful even if you can’t, in fact, imagine them.

Up above, embed­ding our color swatches into one or two dimen­sions was straightforward; the map­ping was obvious. But how do we embed a face or a sen­tence into a numeric space with a hun­dred dimen­sions? How do we learn to map from “I went looking for adventure” to (-0.0036, -0.063, 0.014, … ) and back?

One tool we can use is called a vari­a­tional autoen­coder. It’s a kind of neural net­work that learns to embed rich data into numeric space, and not only embed it, but “pack” it densely. A vari­a­tional autoen­coder, even more than nature, abhors a vacuum.

In aca­d­emic papers about autoen­coders (like this one) you’ll often see a dia­gram demon­strating how a dataset of celebrity faces has been embedded into numeric space. The paper will show smooth (and per­haps slightly unset­tling) gra­dients between points in that space, each of which rep­re­sents a unique face:

Here’s where things get inter­esting. In 2016, a paper called “Generating Sen­tences from a Con­tin­uous Space” pub­lished by Samuel R. Bowman, Luke Vilnis, et. al., showed that you can use a vari­a­tional autoen­coder to embed sen­tences into numeric space, and pio­neered a few tech­niques to make it possible.

The paper also introduced, along the same lines as the unset­tling celebrity gra­dient, the con­cept of a smooth homotopy, or linear inter­po­la­tion, between sen­tences. I under­stood these imme­di­ately as sen­tence gra­dients and as soon as I read the paper … I had to have them.

Programming is hard

I tried to imple­ment the paper myself. I failed. Even after cor­re­sponding with the authors, I just couldn’t get the basic autoen­coding engine to work.

Lucky me: not even a year later, another paper appeared, extending the work of Bowman, et. al. “A Hybrid Con­vo­lu­tional Vari­a­tional Autoen­coder for Text Generation” by Stanislau Semeniuta, Ali­aksei Severyn, and Erhardt Barth offered sub­stan­tial addi­tions to the idea and, even better: it offered THE CODE!!

(Let me just take a moment to praise researchers who pub­lish their code. Without this project from Semeniuta, et. al., as a starting point, I would never have been able to explore these tech­niques. What a gift.)

Code in hand, I was well on my way to gen­er­ating sen­tence gra­dients myself. I fig­ured out the math to move through sen­tence space, imple­mented a few fea­tures to help orga­nize exper­i­ments, added a simple server.

But there was a per­sis­tent problem: it ran too slow. I would write two sen­tences, ask the neural net­work to gen­erate a gra­dient between them, and … wait. And wait and wait. Min­utes passed. The process was too drawn out for exper­i­mentation, for exploration, for play.

Again, I tried to fix it myself. But I didn’t (and still don’t!) under­stand the inner­most engine enough to see how I could speed up that process of moving sen­tences in and out of numeric space.

That’s when I asked for help.

The pro­grammer Richard Assar’s imple­mentation of a paper called SampleRNN, shared on GitHub, had impressed me with its usability and its speed. Sound gen­erated by his code made its way into the audio­book of my latest novel. So, I reached out to him, asking, could I com­mis­sion you to take a look at this sen­tence space project?

Richard said yes, and overnight — not lit­er­ally overnight, but … basically overnight — he made it go faster by an order of magnitude. Now, you only wait a beat. Sen­tence one, sen­tence two, beat … gra­dient.

Working with this code shared by Semeniuta, et. al., stream­lined by Richard Assar, what did I end up with?

Welcome to sentence space

My project sen­tence-space, now public on GitHub, pro­vides an API that serves up two things:

  1. Sentence gradients: smooth interpolations between two input sentences.
  2. Sentence neighborhoods: clouds of alternative sentences closely related to an input sentence.

Sen­tence neigh­bor­hoods are sim­pler than gra­dients. Given an input sen­tence, what if we imagine our­selves standing at its loca­tion in sen­tence space, peering around, jot­ting down some of the other sen­tences we see nearby?

From the input

we get

The ship rose from the planet's farm surroundings.
The ship rose from the planet's great surface.
The ship rose from the planet's green corridor.
The ship rose from the rocks of a great mountain.
The ship rose from the planet's farm surface.

You can increase or decrease the dis­tance you peer into sen­tence space from your ini­tial loca­tion; as you increase it, the results get more diverse. Adjust this slider, then use the button again:

Closer
to home

Further
afield

If you drag the slider fully to the left and look around, the results will all be identical, showing you the autoen­coder’s best attempt at cap­turing your orig­inal sen­tence. Its repro­duc­tion is some­times perfect; for example, try “The ship landed on the runway.” Don’t forget the period — it matters!

More often, the autoen­coder returns some­thing that seems … a bit … blurred? The effect gets stronger as your style and sub­ject matter diverge from the autoen­coder’s orig­inal dataset of sen­tences from sci­ence fic­tion. What you’re seeing is the tran­si­tion from the rich­ness of arbi­trary text to the reg­u­larity of this par­tic­ular sen­tence space. It’s very expressive — there are a lot of sen­tences in here to explore — but not infi­nitely so.

Anyway.

After I’d gotten this up and running, I felt some­thing sim­ilar to what I remem­bered from an ear­lier machine learning project: a sense of, well, I did it … now what?

That feeling is an impor­tant waystation. Sen­tence gra­dients are weird; maybe nothing more than lin­guistic baubles. But I believe there’s some­thing unde­ni­ably deep and provoca­tive about this space packed full of language. Drawing gra­dients and exploring neigh­bor­hoods are just two ways of moving through it. How else might you travel?

I’ve pub­lished the code, which is mostly the work of Semeniuta, et. al., with impor­tant improve­ments by Richard Assar and a few embroi­deries by me.

Maybe you can imagine some­thing dif­ferent to do inside this sci­ence fic­tion sen­tence space, or maybe you’d rather estab­lish a space all your own, built on sen­tences of your choosing. You could imple­ment new operations; maybe you want to add sen­tences together or find the average of many sen­tences. These spaces are dense with meaning and dif­fi­cult to wrap your head around, and to me, that’s a very attrac­tive combination.

As before, this is all about making tools (that make language) for humans to use — never automatic, always interactive. Playful, surprising, destabilizing. The other day, I gen­erated a gra­dient, and a sen­tence appeared that I haven’t stopped thinking about since:

“The infor­ma­tion buzzed, emp­tying his lips.”

What a sequence of words! I’d never have written that on my own, and now I want to use it somewhere — a trea­sure smug­gled out of sen­tence space.

Go explore. Send back reports of your progress.

Or stay here on this page and play a little.


Thanks to Dan Bouk for his feed­back on a draft of this post. Dan wrote a book about how not just sen­tences but whole lives got plotted and gridded, smoothed and statisticized. How Our Days Became Numbered is essen­tial reading.


Feb­ruary 2018, Berkeley

March 2018, Berkeley