This is a post from Robin Sloan’s lab blog & notebook. You can visit the blog’s homepage, or learn more about me.

The cutouts

October 8, 2021

This blog post is for approx­i­mately twelve people, but those twelve people are going to love it!

This year, there’s been a pro­lif­er­a­tion of wonderful, zine-y Colab note­books that gen­erate images from text prompts, all using OpenAI’s pro­found CLIP model along­side dif­ferent kinds of image gen­er­ators.

Reading these note­books, I kept run­ning into a chunk of code like this:

cutouts = []
for _ in range(self.cutn):
    size = int(torch.rand([])**self.cut_pow * (max_size - min_size) + min_size)
    offsetx = torch.randint(0, sideX - size + 1, ())
    offsety = torch.randint(0, sideY - size + 1, ())
    cutout = input[:, :, offsety:offsety + size, offsetx:offsetx + size]
    cutouts.append(F.adaptive_avg_pool2d(cutout, self.cut_size))
return torch.cat(cutouts)

I read this as: “from the image under consideration, gen­erate a bunch of views of smaller portions, their loca­tions ran­dom­ized.” But I had no idea why you’d want to do that.

The pixray project from @dribnet goes further; it doesn’t only offset the views, but warps their per­spec­tive and shifts their colors!

augmentations.append(K.CenterCrop(size=self.cut_size, cropping_mode='resample', p=1.0, return_transform=True))
augmentations.append(K.RandomPerspective(distortion_scale=0.20, p=0.7, return_transform=True))
augmentations.append(K.ColorJitter(hue=0.1, saturation=0.1, p=0.8, return_transform=True))

The note­books dis­cuss the number of these “cutouts” as a key deter­mi­nant of quality, as well as memory consumption. Well, I am always inter­ested in both quality AND memory consumption, so I wanted to figure out what the cutouts were actu­ally doing, and the code alone wasn’t forthcoming.

Characteristically, it was a tweet from the pro­lific Mario Klingerman, dis­cov­ered on page three of a Google search, that pro­vided the answer:

CLIP can only process patches of 224x224 so the typ­ical way of evolving an image involves making a batch random crops at dif­ferent scales so it can work on the whole image as well as details at the same time. [ … ]

I also got an inti­ma­tion of this from Ryan Murdock, who basi­cally insti­gated all of these zines, in an inter­view on Der­rick Schultz’s YouTube channel; he locates the technique’s dis­covery pre­cisely in his exper­i­ments with these cutouts.

Here’s how I under­stand it now: the cutouts are dif­ferent “ways of looking” at the image being gen­erated. Maybe one cutout cen­ters on the shadow fleeing down the corridor, while another looks closely at the pool of blood on the marble floor, and a third frames both details together. The poten­tial for the cutouts to overlap and aggre­gate feels impor­tant to me; they don’t rep­re­sent a grid-like decomposition, but rather a stream of glances. They feel very much like the way you’d absorb a painting at a museum, honestly.

(Here is your reminder that the eyes doing the glancing are CLIP’s, which then reports the numeric ver­sion of: “Eh, looks to me like some­body spilled some ketchup, but, if the fleeing shadow had a knife, then I might call it a murder … ” To which the image gen­er­ator replies: “Got it! Adding a knife!”)

In most of the note­books I’ve encountered, there are between 32 and 40 cutouts. That number is mostly deter­mined by memory constraints, but I wonder if, even granted infi­nite VRAM, there’s a cutout sweet spot? Often, sys­tems of this kind thrive on restrictions.

I imagine 32 periscopes peeking up from the water, swiveling to find their targets, trying to make sense of the panorama of the world.

I believe these cutouts are newly ran­dom­ized on each step of the gen­er­ator’s march towards a sat­is­fac­tory image, so it’s not only an aggre­ga­tion of views across space, but also over time, as more and more “ways of looking” are evaluated.

Here’s a quick example: four images, all gen­erated using the same prompt, settings, and random seed. The only dif­fer­ence is the number of cutouts, which decreases from the upper-left, clockwise: 30 to 20 to 10 to 2.

A grid of four images; as the number of cutouts decrease, they get a bit blurrier, more "general".

This is just my interpretation, but, as the number of cutouts decreases, I think I see the images get­ting both fuzzier and more “general”; there is per­haps a sense of CLIP squinting at the whole thing — “sure, that’s dark queen-ish”—rather than attending to par­tic­ular details.

I’m not sure any of n=(30-10) are “better” or “worse,” though — which is inter­esting and, if you ask me, heartening.

Okay, I hope you twelve people got some­thing out of this!

To the blog home page