Robin Sloan
the lab
August 2018

Expressive temperature

This is a very niche post doc­u­menting a tech­nique I think might be useful to artists — espe­cially musicians — who want to work with mate­rial sam­pled from machine learning sys­tems.

There’s a primer below for people who are intrigued but not steeped in these tools. If you are, ahem, well-steeped, skip directly to the heading Turning the knob. Otherwise, proceed!

A wild horn

A primer on temperature

In machine learning sys­tems designed to gen­erate text or audio, there’s often a final step between the system’s model of the training data and some con­crete output that a human can eval­uate or enjoy. Technically, this is the sam­pling of a multi­n­o­mial distribution, but call it rolling a giant dice, many-many-sided, and also weighted. Weird dice.

To that final dice roll, there’s a factor applied that is called, by convention, the “sam­pling tem­per­a­ture.” Its default value is 1.0, which means we roll the weighted dice just as the model has pro­vided it. But we can choose other values, too. We can tamper with the dice.

It’s eas­iest to under­stand with examples.

Here’s some text sam­pled from a model trained on sci­ence fic­tion stories. First, let’s set the sam­pling tem­per­a­ture to 0.1. The weights get divided by the tem­per­a­ture, so a value less than 1.0 will make heavy weights much heavier — which is to say, more likely — giving us a drone of the training data’s most basic themes:

It was a strange thing. The strange state­ment was that the state of the planet was a strange state of spe­cial problems.

You could sample pages and pages of text sam­pled at this tem­per­a­ture and you wouldn’t read about much besides planets and strangeness.

By contrast, if we set the tem­per­a­ture to 1.0, we get a nice, sur­prising sample:

It was here yesterday. There was a double fic­tional device with croc­o­dile shoul­ders and phoney oceans and Katchanni.

Katchanni! Okay! You will never, ever find Katchanni at tem­per­a­ture 0.1.

We can crank the sam­pling tem­per­a­ture up even higher. Remember, we’re dividing, so values above 1.0 mean the weights get smoothed out, making our dice “fairer”—weirder things become pos­sible. At 1.5, the text begins to feel like a pot left unat­tended on the stove, hissing and rattling:

It was Winstead, my balked, old-fashioned 46 fodetes ratted.

And maybe that’s an effect we want sometimes! This is the point. I shall put it in bold. Often, when you’re using a machine learning system to gen­erate mate­rial, sam­pling tem­per­a­ture is a key expres­sive con­trol.

Sooo … let’s con­trol it.

Turning the knob

If I repeated the exer­cise above but sub­sti­tuted sam­ples of audio for text, you would rec­og­nize the same pattern: at low (“cool”?) tem­per­a­tures, cau­tion and repetition; around 1.0, a solid attempt to rep­re­sent the diver­sity of the training data; beyond, Here Be Dragons.

It occurred to me, deep into some exper­i­ments with audio, that it should be pos­sible to change the tem­per­a­ture not just between sam­ples, but during them. In other words, to treat sam­pling tem­per­a­ture the way you might a filter knob on a syn­the­sizer, sweeping it up or down to lend your sound move­ment and drama.

Let’s play around.

The toolkit:

The audio snip­pets below are all straight out of Sam­pleRNN, with no pro­cessing or editing, but they are unapolo­get­i­cally cherry-picked. They all have a sound that’s char­ac­ter­istic of this system: noisy, jangly, a bit wobbly. If you don’t like that sound, it’s likely you won’t find any of this par­tic­u­larly compelling, and … you should prob­ably go listen to some­thing else!

Finally — I feel like I always end up including some ver­sion of this caveat-slash-manifesto — I’m attracted to these tech­niques because they pro­duce mate­rial with inter­esting (maybe unique?) char­ac­ter­istics that an author or artist can then edit, remix, and/or cast aside. Please con­sider the sam­ples below in that context. Other people — researchers and tin­kerers alike — are more moti­vated by the dream of a system that can write a whole song end-to-end. As they progress toward that goal … I will hap­pily mis­ap­pro­priate their tools and bend them to my pur­poses 😎

Okay! To begin, here’s a sample gen­erated The Normal Way, at con­stant tem­per­a­ture 1.0.

I think it sounds nice, but/and it has the char­ac­ter­istic “meander” of sam­ples gen­erated by these sys­tems, text and audio alike. They lack long-range structure; they’re not “going” anywhere. It’s not a bad meander, and there are def­i­nitely ways to use this mate­rial cre­atively and productively.

But what if we want some­thing dif­ferent?

Here’s another sample — same model, same everything — gen­erated by sweeping the tem­per­a­ture from 0.75 to 1.1 and back:

You can hear the difference, right? It’s not better or worse, just dif­ferent, with more motion — a crescendo into com­plexity.

Let’s go even further.

This next sample was gen­erated using (1) that same tem­per­a­ture sweep, and also (2) a “guiding track”—a tiny looping snippet of audio copied from the training data. You’ll hear it. At low tem­per­a­tures, the guiding track is used to “nudge” the model. (Guardrail? Straightjacket?) As the tem­per­a­ture increases, the guiding track’s influ­ence fades until the model isn’t being nudged at all, and is free … to rock out.

At this point, we aren’t using Sam­pleRNN remotely the way it was intended; is this even machine learning anymore? If all we were get­ting out of this com­pu­ta­tional detour was a sine wave, it would be fair to say we were using the wrong tool for the job.

But … we’re get­ting some­thing quite dif­ferent!

Of course, we can turn the tem­per­a­ture knob how­ever we want. Let’s try a more com­plex curve:

And here’s one more sample gen­erated using (1) that same tem­per­a­ture curve, but (2) a dif­ferent model, this one trained on a bundle of synth-y film music. Again, this is straight out of Sam­pleRNN, so expect some burps and squawks:

I mean … a person could make some­thing out of that!

None of this sam­pling works real-time, at least not out­side of Google. You cannot yet turn a lit­eral knob and listen while the sound changes — seeking, adjusting, conducting — but that “yet” has an expi­ra­tion date, and I think we’re going to see a very new kind of syn­the­sizer appear very soon now.

This post is also avail­able in Russian.

August 2018, Oak­land