Voyages in sentence space

Well, it’s January 2020, and the day has finally come: I can no longer operate the software that made this essay interactive. I’ll leave the text up, but please understand that this essay is now considered “broken.” That’s the challenge this kind of presentation poses: it’s very cool when it works, but then — and this happens eventually, inevitably, entropically — it doesn’t.

Imagine a sentence. “I went looking for adventure.”

Imagine another one. “I never returned.”

Now imagine a sentence gradient between them — not a story, but a smooth interpolation of meaning. This is a weird thing to ask for! I’d never even bothered to imagine an inter­po­la­tion between sentences before encoun­tering the idea in a recent academic paper. But as soon as I did, I found it captivating, both for the thing itself — a sentence … gradient? — and for the larger artifact it suggested: a dense cloud of sentences, all related; a space you might navigate and explore.

Here’s what a neural network instructed to produce such a cloud of sentences (specifically, sentences from science fiction) delivers when you ask it to draw a gradient between “I went looking for adventure.” and “I never returned.”

I won't acknowledge.
I stared after midnight
I stared agitated.
I get definitely.
I never returned.

You can ask it to draw a gradient of your own! Just replace the first and last sentences and use this button:

So, does that sentence gradient make sense? I honestly don’t know. Is it useful? Probably not! But I do know it’s inter­esting, and the larger artifact — the dense cloud, the sentence space — feels very much like something worth exploring.

A comfortable embedding

I’ve been exploring neural networks — in partic­ular, neural networks that generate text — for a while now. (You can find a previous exper­i­ment here.)

When you’re tinkering with these tools, trying to produce something inter­esting (maybe even artful) from a dataset, whether it’s composed of text or images or something else, you often find yourself embedding that data into numeric space.

At a super simple level, imagine a dataset consisting of color swatches: rusty orange, dusty magenta, deep purple. You can see why it might make sense to embed these stand­alone swatches into a one-dimensional number line, a smooth sweep of color — 

—so each has its own coordinate and there are also, as a signif­i­cant added bonus, coor­di­nates for all the inter­me­diate colors between them.

Imagine a more complex dataset consisting of more colors. You can see how two dimen­sions might be useful:

Just like that, this dataset becomes something a neural network can chomp on, because it’s no longer color swatches described with metaphors, but a set of numbers. You know what computers love? SETS OF NUMBERS.

In practice, because datasets are often very rich — images of faces, sentences from science fiction stories — the numeric space into which you do this embedding will have not just two but dozens or hundreds of dimen­sions. I definitely can’t visualize a space like that — but so what? It turns out imaginary spaces are useful even if you can’t, in fact, imagine them.

Up above, embedding our color swatches into one or two dimensions was straightforward; the mapping was obvious. But how do we embed a face or a sentence into a numeric space with a hundred dimen­sions? How do we learn to map from “I went looking for adventure” to (-0.0036, -0.063, 0.014, … ) and back?

One tool we can use is called a variational autoencoder. It’s a kind of neural network that learns to embed rich data into numeric space, and not only embed it, but “pack” it densely. A vari­a­tional autoen­coder, even more than nature, abhors a vacuum.

In academic papers about autoen­coders (like this one) you’ll often see a diagram demon­strating how a dataset of celebrity faces has been embedded into numeric space. The paper will show smooth (and perhaps slightly unset­tling) gradients between points in that space, each of which repre­sents a unique face:

Here’s where things get inter­esting. In 2016, a paper called “Generating Sentences from a Contin­uous Space” published by Samuel R. Bowman, Luke Vilnis, et. al., showed that you can use a vari­a­tional autoen­coder to embed sentences into numeric space, and pioneered a few tech­niques to make it possible.

The paper also introduced, along the same lines as the unset­tling celebrity gradient, the concept of a smooth homotopy, or linear inter­po­la­tion, between sentences. I understood these imme­di­ately as sentence gradients and as soon as I read the paper … I had to have them.

Programming is hard

I tried to implement the paper myself. I failed. Even after corre­sponding with the authors, I just couldn’t get the basic autoen­coding engine to work.

Lucky me: not even a year later, another paper appeared, extending the work of Bowman, et. al. “A Hybrid Convo­lu­tional Vari­a­tional Autoen­coder for Text Generation” by Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth offered substan­tial additions to the idea and, even better: it offered THE CODE!!

(Let me just take a moment to praise researchers who publish their code. Without this project from Semeniuta, et. al., as a starting point, I would never have been able to explore these tech­niques. What a gift.)

Code in hand, I was well on my way to gener­ating sentence gradients myself. I figured out the math to move through sentence space, imple­mented a few features to help organize exper­i­ments, added a simple server.

But there was a persis­tent problem: it ran too slow. I would write two sentences, ask the neural network to generate a gradient between them, and … wait. And wait and wait. Minutes passed. The process was too drawn out for exper­i­mentation, for exploration, for play.

Again, I tried to fix it myself. But I didn’t (and still don’t!) under­stand the innermost engine enough to see how I could speed up that process of moving sentences in and out of numeric space.

That’s when I asked for help.

The programmer Richard Assar’s implementation of a paper called SampleRNN, shared on GitHub, had impressed me with its usability and its speed. Sound generated by his code made its way into the audiobook of my latest novel. So, I reached out to him, asking, could I commission you to take a look at this sentence space project?

Richard said yes, and overnight — not literally overnight, but … basically overnight — he made it go faster by an order of magnitude. Now, you only wait a beat. Sentence one, sentence two, beat … gradient.

Working with this code shared by Semeniuta, et. al., stream­lined by Richard Assar, what did I end up with?

Welcome to sentence space

My project sentence-space, now public on GitHub, provides an API that serves up two things:

  1. Sentence gradients: smooth interpolations between two input sentences.
  2. Sentence neighborhoods: clouds of alternative sentences closely related to an input sentence.

Sentence neigh­bor­hoods are simpler than gradients. Given an input sentence, what if we imagine ourselves standing at its location in sentence space, peering around, jotting down some of the other sentences we see nearby?

From the input

we get

The ship rose from the planet's farm surroundings.
The ship rose from the planet's great surface.
The ship rose from the planet's green corridor.
The ship rose from the rocks of a great mountain.
The ship rose from the planet's farm surface.

You can increase or decrease the distance you peer into sentence space from your initial location; as you increase it, the results get more diverse. Adjust this slider, then use the button again:

Closer
to home

Further
afield

If you drag the slider fully to the left and look around, the results will all be identical, showing you the autoen­coder’s best attempt at capturing your original sentence. Its repro­duc­tion is sometimes perfect; for example, try “The ship landed on the runway.” Don’t forget the period — it matters!

More often, the autoencoder returns something that seems … a bit … blurred? The effect gets stronger as your style and subject matter diverge from the autoen­coder’s original dataset of sentences from science fiction. What you’re seeing is the tran­si­tion from the richness of arbitrary text to the regu­larity of this partic­ular sentence space. It’s very expressive — there are a lot of sentences in here to explore — but not infinitely so.

Anyway.

After I’d gotten this up and running, I felt something similar to what I remembered from an earlier machine learning project: a sense of, well, I did it … now what?

That feeling is an important waystation. Sentence gradients are weird; maybe nothing more than linguistic baubles. But I believe there’s something unde­ni­ably deep and provoca­tive about this space packed full of language. Drawing gradients and exploring neigh­bor­hoods are just two ways of moving through it. How else might you travel?

I’ve published the code, which is mostly the work of Semeniuta, et. al., with important improve­ments by Richard Assar and a few embroi­deries by me.

Maybe you can imagine something different to do inside this science fiction sentence space, or maybe you’d rather establish a space all your own, built on sentences of your choosing. You could implement new operations; maybe you want to add sentences together or find the average of many sentences. These spaces are dense with meaning and difficult to wrap your head around, and to me, that’s a very attractive combination.

As before, this is all about making tools (that make language) for humans to use — never automatic, always interactive. Playful, surprising, destabilizing. The other day, I generated a gradient, and a sentence appeared that I haven’t stopped thinking about since:

“The infor­ma­tion buzzed, emptying his lips.”

What a sequence of words! I’d never have written that on my own, and now I want to use it somewhere — a treasure smuggled out of sentence space.

Go explore. Send back reports of your progress.

Or stay here on this page and play a little.


Thanks to Dan Bouk for his feedback on a draft of this post. Dan wrote a book about how not just sentences but whole lives got plotted and gridded, smoothed and statisticized. How Our Days Became Numbered is essential reading.


February 2018, Berkeley

March 2018, Berkeley