Expressive temperature

This is a very niche post documenting a technique I think might be useful to artists — especially musicians — who want to work with material sampled from machine learning systems.

There’s a primer below for people who are intrigued but not steeped in these tools. If you are, ahem, well-steeped, skip directly to the heading Turning the knob. Otherwise, proceed!

A primer on temperature

In machine learning systems designed to generate text or audio, there’s often a final step between the system’s model of the training data and some concrete output that a human can evaluate or enjoy. Technically, this is the sampling of a multinomial distribution, but call it rolling a giant dice, many-many-sided, and also weighted. Weird dice.

To that final dice roll, there’s a factor applied that is called, by convention, the “sampling temperature.” Its default value is 1.0, which means we roll the weighted dice just as the model has provided it. But we can choose other values, too. We can tamper with the dice.

It’s easiest to understand with examples.

Here’s some text sampled from a model trained on science fiction stories. First, let’s set the sampling temperature to 0.1. The weights get divided by the temperature, so a value less than 1.0 will make heavy weights much heavier — which is to say, more likely — giving us a drone of the training data’s most basic themes:

It was a strange thing. The strange statement was that the state of the planet was a strange state of special problems.

You could sample pages and pages of text sampled at this temperature and you wouldn’t read about much besides planets and strangeness.

By contrast, if we set the temperature to 1.0, we get a nice, surprising sample:

It was here yesterday. There was a double fictional device with crocodile shoulders and phoney oceans and Katchanni.

Katchanni! Okay! You will never, ever find Katchanni at temperature 0.1.

We can crank the sampling temperature up even higher. Remember, we’re dividing, so values above 1.0 mean the weights get smoothed out, making our dice “fairer”—weirder things become possible. At 1.5, the text begins to feel like a pot left unattended on the stove, hissing and rattling:

It was Winstead, my balked, old-fashioned 46 fodetes ratted.

And maybe that’s an effect we want sometimes! This is the point. I shall put it in bold. Often, when you’re using a machine learning system to generate material, sampling temperature is a key expressive control.

Sooo… let’s control it.

Turning the knob

If I repeated the exercise above but substituted samples of audio for text, you would recognize the same pattern: at low (“cool”?) temperatures, caution and repetition; around 1.0, a solid attempt to represent the diversity of the training data; beyond, Here Be Dragons.

It occurred to me, deep into some experiments with audio, that it should be possible to change the temperature not just between samples, but during them. In other words, to treat sampling temperature the way you might a filter knob on a synthesizer, sweeping it up or down to lend your sound movement and drama.

Let’s play around.

The toolkit:

  • Richard Assar's implementation of SampleRNN, which is also what I used to generate sound for the audiobook of my most recent novel. It's no longer state-of-the-art, but it produces results more interesting, to my ear, than anything currently available as open source.
  • A model trained on several different performances of a song I like (which shall remain unnamed). It's a relatively small dataset, so, in practice, this model acts as a grungy, circuitous kind of sampler, transforming and regurgitating the song in novel ways.

The audio snippets below are all straight out of SampleRNN, with no processing or editing, but they are unapologetically cherry-picked. They all have a sound that’s characteristic of this system: noisy, jangly, a bit wobbly. If you don’t like that sound, it’s likely you won’t find any of this particularly compelling, and… you should probably go listen to something else!

Finally — I feel like I always end up including some version of this caveat-slash-manifesto — I’m attracted to these techniques because they produce material with interesting (maybe unique?) characteristics that an author or artist can then edit, remix, and/or cast aside. Please consider the samples below in that context. Other people — researchers and tinkerers alike — are more motivated by the dream of a system that can write a whole song end-to-end. As they progress toward that goal… I will happily misappropriate their tools and bend them to my purposes

Okay! To begin, here’s a sample generated The Normal Way, at constant temperature 1.0.

One-song model, constant temperature, 30 seconds

I think it sounds nice, but/and it has the characteristic “meander” of samples generated by these systems, text and audio alike. They lack long-range structure; they’re not “going” anywhere. It’s not a bad meander, and there are definitely ways to use this material creatively and productively.

But what if we want something different?

Here’s another sample — same model, same everything — generated by sweeping the temperature from 0.75 to 1.1 and back:

One-song model, temperature, 30 seconds

You can hear the difference, right? It’s not better or worse, just different, with more motion — a crescendo into complexity.

Let’s go even further.

This next sample was generated using (1) that same temperature sweep, and also (2) a “guiding track”—a tiny looping snippet of audio copied from the training data. You’ll hear it. At low temperatures, the guiding track is used to “nudge” the model. (Guardrail? Straightjacket?) As the temperature increases, the guiding track’s influence fades until the model isn’t being nudged at all, and is free… to rock out.

One-song model, temperature, guiding track, 30 seconds

At this point, we aren’t using SampleRNN remotely the way it was intended; is this even machine learning anymore? If all we were getting out of this computational detour was a sine wave, it would be fair to say we were using the wrong tool for the job.

But… we’re getting something quite different!

Of course, we can turn the temperature knob however we want. Let’s try a more complex curve:

One-song model, temperature, 90 seconds

And here’s one more sample generated using (1) that same temperature curve, but (2) a different model, this one trained on a bundle of synth-y film music. Again, this is straight out of SampleRNN, so expect some burps and squawks:

Richer model, temperature, 90 seconds

I mean… a person could make something out of that!

None of this sampling works real-time, at least not outside of Google. You cannot yet turn a literal knob and listen while the sound changes — seeking, adjusting, conducting — but that “yet” has an expiration date, and I think we’re going to see a very new kind of synthesizer appear very soon now.

This post is also available in Russian.

August 2018, Oakland

You can explore my other blog posts.

The main thing to do here is sign up for my email newsletter, which is infrequent and wide-ranging. I always try to make it feel like a note from a friend: