Robin Sloan
the lab
August 2018

Expressive temperature

This is a very niche post docu­menting a technique I think might be useful to artists — especially musicians — who want to work with material sampled from machine learning systems.

There’s a primer below for people who are intrigued but not steeped in these tools. If you are, ahem, well-steeped, skip directly to the heading Turning the knob. Otherwise, proceed!

A wild horn

A primer on temperature

In machine learning systems designed to generate text or audio, there’s often a final step between the system’s model of the training data and some concrete output that a human can evaluate or enjoy. Technically, this is the sampling of a multino­mial distribution, but call it rolling a giant dice, many-many-sided, and also weighted. Weird dice.

To that final dice roll, there’s a factor applied that is called, by convention, the “sampling temper­a­ture.” Its default value is 1.0, which means we roll the weighted dice just as the model has provided it. But we can choose other values, too. We can tamper with the dice.

It’s easiest to under­stand with examples.

Here’s some text sampled from a model trained on science fiction stories. First, let’s set the sampling temperature to 0.1. The weights get divided by the temper­a­ture, so a value less than 1.0 will make heavy weights much heavier — which is to say, more likely — giving us a drone of the training data’s most basic themes:

It was a strange thing. The strange statement was that the state of the planet was a strange state of special problems.

You could sample pages and pages of text sampled at this temper­a­ture and you wouldn’t read about much besides planets and strangeness.

By contrast, if we set the temper­a­ture to 1.0, we get a nice, surprising sample:

It was here yesterday. There was a double fictional device with crocodile shoulders and phoney oceans and Katchanni.

Katchanni! Okay! You will never, ever find Katchanni at temper­a­ture 0.1.

We can crank the sampling temper­a­ture up even higher. Remember, we’re dividing, so values above 1.0 mean the weights get smoothed out, making our dice “fairer”—weirder things become possible. At 1.5, the text begins to feel like a pot left unat­tended on the stove, hissing and rattling:

It was Winstead, my balked, old-fashioned 46 fodetes ratted.

And maybe that’s an effect we want sometimes! This is the point. I shall put it in bold. Often, when you’re using a machine learning system to generate material, sampling temper­a­ture is a key expres­sive control.

Sooo … let’s control it.

Turning the knob

If I repeated the exercise above but substi­tuted samples of audio for text, you would recognize the same pattern: at low (“cool”?) temper­a­tures, caution and repetition; around 1.0, a solid attempt to represent the diversity of the training data; beyond, Here Be Dragons.

It occurred to me, deep into some exper­i­ments with audio, that it should be possible to change the temperature not just between samples, but during them. In other words, to treat sampling temper­a­ture the way you might a filter knob on a synthe­sizer, sweeping it up or down to lend your sound movement and drama.

Let’s play around.

The toolkit:

The audio snippets below are all straight out of SampleRNN, with no processing or editing, but they are unapolo­get­i­cally cherry-picked. They all have a sound that’s char­ac­ter­istic of this system: noisy, jangly, a bit wobbly. If you don’t like that sound, it’s likely you won’t find any of this partic­u­larly compelling, and … you should probably go listen to something else!

Finally — I feel like I always end up including some version of this caveat-slash-manifesto — I’m attracted to these tech­niques because they produce material with inter­esting (maybe unique?) char­ac­ter­istics that an author or artist can then edit, remix, and/or cast aside. Please consider the samples below in that context. Other people — researchers and tinkerers alike — are more motivated by the dream of a system that can write a whole song end-to-end. As they progress toward that goal … I will happily misap­pro­priate their tools and bend them to my purposes 😎

Okay! To begin, here’s a sample generated The Normal Way, at constant temper­a­ture 1.0.

I think it sounds nice, but/and it has the char­ac­ter­istic “meander” of samples generated by these systems, text and audio alike. They lack long-range structure; they’re not “going” anywhere. It’s not a bad meander, and there are defi­nitely ways to use this material creatively and productively.

But what if we want something different?

Here’s another sample — same model, same everything — generated by sweeping the temper­a­ture from 0.75 to 1.1 and back:

You can hear the difference, right? It’s not better or worse, just different, with more motion — a crescendo into complexity.

Let’s go even further.

This next sample was generated using (1) that same temperature sweep, and also (2) a “guiding track”—a tiny looping snippet of audio copied from the training data. You’ll hear it. At low temperatures, the guiding track is used to “nudge” the model. (Guardrail? Straightjacket?) As the temper­a­ture increases, the guiding track’s influence fades until the model isn’t being nudged at all, and is free … to rock out.

At this point, we aren’t using SampleRNN remotely the way it was intended; is this even machine learning anymore? If all we were getting out of this compu­ta­tional detour was a sine wave, it would be fair to say we were using the wrong tool for the job.

But … we’re getting something quite different!

Of course, we can turn the temper­a­ture knob however we want. Let’s try a more complex curve:

And here’s one more sample generated using (1) that same temper­a­ture curve, but (2) a different model, this one trained on a bundle of synth-y film music. Again, this is straight out of SampleRNN, so expect some burps and squawks:

I mean … a person could make something out of that!

None of this sampling works real-time, at least not outside of Google. You cannot yet turn a literal knob and listen while the sound changes — seeking, adjusting, conducting — but that “yet” has an expi­ra­tion date, and I think we’re going to see a very new kind of synthe­sizer appear very soon now.

This post is also available in Russian.

August 2018, Oakland