This blog post is for approximately twelve people, but those twelve people are going to love it!
This year, there’s been a proliferation of wonderful, zine-y Colab notebooks that generate images from text prompts, all using OpenAI’s profound CLIP model alongside different kinds of image generators.
Reading these notebooks, I kept running into a chunk of code like this:
cutouts =  for _ in range(self.cutn): size = int(torch.rand()**self.cut_pow * (max_size - min_size) + min_size) offsetx = torch.randint(0, sideX - size + 1, ()) offsety = torch.randint(0, sideY - size + 1, ()) cutout = input[:, :, offsety:offsety + size, offsetx:offsetx + size] cutouts.append(F.adaptive_avg_pool2d(cutout, self.cut_size)) return torch.cat(cutouts)
I read this as: “from the image under consideration, generate a bunch of views of smaller portions, their locations randomized.” But I had no idea why you’d want to do that.
The pixray project from @dribnet goes further; it doesn’t only offset the views, but warps their perspective and shifts their colors!
augmentations.append(K.CenterCrop(size=self.cut_size, cropping_mode='resample', p=1.0, return_transform=True)) augmentations.append(K.RandomPerspective(distortion_scale=0.20, p=0.7, return_transform=True)) augmentations.append(K.ColorJitter(hue=0.1, saturation=0.1, p=0.8, return_transform=True))
The notebooks discuss the number of these “cutouts” as a key determinant of quality, as well as memory consumption. Well, I am always interested in both quality AND memory consumption, so I wanted to figure out what the cutouts were actually doing, and the code alone wasn’t forthcoming.
Characteristically, it was a tweet from the prolific Mario Klingerman, discovered on page three of a Google search, that provided the answer:
CLIP can only process patches of 224x224 so the typical way of evolving an image involves making a batch random crops at different scales so it can work on the whole image as well as details at the same time. [ … ]
I also got an intimation of this from Ryan Murdock, who basically instigated all of these zines, in an interview on Derrick Schultz’s YouTube channel; he locates the technique’s discovery precisely in his experiments with these cutouts.
Here’s how I understand it now: the cutouts are different “ways of looking” at the image being generated. Maybe one cutout centers on the shadow fleeing down the corridor, while another looks closely at the pool of blood on the marble floor, and a third frames both details together. The potential for the cutouts to overlap and aggregate feels important to me; they don’t represent a grid-like decomposition, but rather a stream of glances. They feel very much like the way you’d absorb a painting at a museum, honestly.
(Here is your reminder that the eyes doing the glancing are CLIP’s, which then reports the numeric version of: “Eh, looks to me like somebody spilled some ketchup, but, if the fleeing shadow had a knife, then I might call it a murder … ” To which the image generator replies: “Got it! Adding a knife!”)
In most of the notebooks I’ve encountered, there are between 32 and 40 cutouts. That number is mostly determined by memory constraints, but I wonder if, even granted infinite VRAM, there’s a cutout sweet spot? Often, systems of this kind thrive on restrictions.
I imagine 32 periscopes peeking up from the water, swiveling to find their targets, trying to make sense of the panorama of the world.
I believe these cutouts are newly randomized on each step of the generator’s march towards a satisfactory image, so it’s not only an aggregation of views across space, but also over time, as more and more “ways of looking” are evaluated.
Here’s a quick example: four images, all generated using the same prompt, settings, and random seed. The only difference is the number of cutouts, which decreases from the upper-left, clockwise: 30 to 20 to 10 to 2.
This is just my interpretation, but, as the number of cutouts decreases, I think I see the images getting both fuzzier and more “general”; there is perhaps a sense of CLIP squinting at the whole thing —
I’m not sure any of n=(30-10) are “better” or “worse,” though —
Okay, I hope you twelve people got something out of this!
October 2021, Berkeley