Robin Sloan
the lab
October 2021

The cutouts

This blog post is for approx­i­mately twelve people, but those twelve peo­ple are going to love it!

This year, there’s been a pro­lif­er­a­tion of wonderful, zine-y Colab note­books that gen­er­ate images from text prompts, all using OpenAI’s pro­found CLIP model along­side dif­fer­ent kinds of image gen­er­a­tors.

Reading these note­books, I kept run­ning into a chunk of code like this:

cutouts = []
for _ in range(self.cutn):
    size = int(torch.rand([])**self.cut_pow * (max_size - min_size) + min_size)
    offsetx = torch.randint(0, sideX - size + 1, ())
    offsety = torch.randint(0, sideY - size + 1, ())
    cutout = input[:, :, offsety:offsety + size, offsetx:offsetx + size]
    cutouts.append(F.adaptive_avg_pool2d(cutout, self.cut_size))
return torch.cat(cutouts)

I read this as: “from the image under consideration, gen­er­ate a bunch of views of smaller portions, their loca­tions ran­domized.” But I had no idea why you’d want to do that.

The pixray project from @dribnet goes further; it doesn’t only off­set the views, but warps their per­spec­tive and shifts their colors!

augmentations.append(K.CenterCrop(size=self.cut_size, cropping_mode='resample', p=1.0, return_transform=True))
augmentations.append(K.RandomPerspective(distortion_scale=0.20, p=0.7, return_transform=True))
augmentations.append(K.ColorJitter(hue=0.1, saturation=0.1, p=0.8, return_transform=True))

The note­books dis­cuss the num­ber of these “cutouts” as a key deter­mi­nant of quality, as well as memory consumption. Well, I am always inter­ested in both qual­ity AND mem­ory consumption, so I wanted to fig­ure out what the cutouts were actu­ally doing, and the code alone wasn’t forthcoming.

Characteristically, it was a tweet from the pro­lific Mario Klingerman, dis­cov­ered on page three of a Google search, that pro­vided the answer:

CLIP can only process patches of 224x224 so the typ­i­cal way of evolv­ing an image involves mak­ing a batch ran­dom crops at dif­fer­ent scales so it can work on the whole image as well as details at the same time. [ … ]

I also got an inti­ma­tion of this from Ryan Murdock, who basi­cally insti­gated all of these zines, in an inter­view on Der­rick Schultz’s YouTube channel; he locates the technique’s dis­cov­ery pre­cisely in his exper­i­ments with these cutouts.

Here’s how I understand it now: the cutouts are dif­fer­ent “ways of looking” at the image being gen­er­ated. Maybe one cutout cen­ters on the shadow flee­ing down the corridor, while another looks closely at the pool of blood on the mar­ble floor, and a third frames both details together. The poten­tial for the cutouts to over­lap and aggre­gate feels impor­tant to me; they don’t rep­re­sent a grid-like decomposition, but rather a stream of glances. They feel very much like the way you’d absorb a paint­ing at a museum, honestly.

(Here is your reminder that the eyes doing the glanc­ing are CLIP’s, which then reports the numeric ver­sion of: “Eh, looks to me like some­body spilled some ketchup, but, if the flee­ing shadow had a knife, then I might call it a murder … ” To which the image gen­er­a­tor replies: “Got it! Adding a knife!”)

In most of the note­books I’ve encountered, there are between 32 and 40 cutouts. That num­ber is mostly deter­mined by mem­ory constraints, but I wonder if, even granted infi­nite VRAM, there’s a cutout sweet spot? Often, sys­tems of this kind thrive on restrictions.

I imagine 32 periscopes peek­ing up from the water, swivel­ing to find their targets, try­ing to make sense of the panorama of the world.

I believe these cutouts are newly ran­domized on each step of the gen­er­a­tor’s march towards a sat­is­fac­tory image, so it’s not only an aggre­ga­tion of views across space, but also over time, as more and more “ways of looking” are evaluated.

Here’s a quick example: four images, all gen­er­ated using the same prompt, settings, and ran­dom seed. The only dif­fer­ence is the num­ber of cutouts, which decreases from the upper-left, clockwise: 30 to 20 to 10 to 2.

A grid of four images; as the num­ber of cutouts decrease, they get a bit blurrier, more "general".

This is just my interpretation, but, as the num­ber of cutouts decreases, I think I see the images get­ting both fuzzier and more “general”; there is per­haps a sense of CLIP squint­ing at the whole thing — “sure, that’s dark queen-ish”—rather than attend­ing to particular details.

I’m not sure any of n=(30-10) are “better” or “worse,” though — which is inter­est­ing and, if you ask me, heartening.

Okay, I hope you twelve peo­ple got some­thing out of this!

October 2021, Berkeley