Coming July 24: The Summer Knight

Ghost faves in the mystery machine

Hamlet's friend restrain him while a ghost beckons; a monochrome etching, fine-lined, almost like a comic book illustration.

Hamlet, Horatio, Marcellus and the Ghost, 1796, Robert Thew

Recently, I was chatting with a friend about a problem they’d encountered after running my script to delete their old Twitter faves. (These are technically “likes” now, I know, but I will stubbornly call them faves, because it is the superior word.)

The problem was that, although the API reported a successful unfaving, many old faves were still attached to my friend’s profile. Here’s the strange part: when viewed, the tweets themselves showed hollow hearts, visibly unfaved… yet there they remained, a ghostly list, somehow both fave and not.

Even stranger—at this point it’s getting delicious—the ghost faves could be banished at last by refaving and then unfaving them; by power-cycling the little heart.

Cursory investigation indicates this is a widespread problem. A search for “twitter phantom likes” will reveal many people describing the same behavior exactly, with no evidence of a solution anywhere. Twitter even released a fresh new API endpoint for managing faves—and still, the ghosts are beyond its reach.

This is clearly a bug—the API is not doing what it’s supposed to do, or even, it seems, what it “thinks it’s doing”—but I am not here to discuss a bug; rather, what the bug made me think about.

My understanding of large internet systems was transformed when I read about Facebook’s Mystery Machine. This was a tool documented back in 2014; I assume it is not in use anymore, but/and, to me, it’s still exemplary of the way these systems work, or don’t.

Facebook’s paper lays out the problem:

Current debugging and optimization methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous heterogeneous software components over a distributed set of computers.

In other words: how do you analyze or debug a system made up of many different components, written by many different people in many different programming languages, running in many different places at many different times… that activate each other in complicated cascades? A system that grew organically, and very quickly? That was uhhh perhaps not perfectly documented along the way?

A “rational” answer might be: you write some code that can capture an X-ray of what’s happening, and you put that code into all those components! I mean, this kind of software exists. But the reality of Facebook circa 2014, its messiness and scale, meant that simply wasn’t going to happen.

So what did they do instead?

Consequently, we develop a technique that generates a causal model of system behavior without the need to add substantial new instrumentation or manually generate a schema of application behavior. Instead, we generate the model via large-scale reasoning over individual software component logs.

In plain language: they watched the whole system as it ran, the parts of it they could see—the log messages—and inferred its operation from that activity, the way it played out in time. The Mystery Machine worked by literally hypothesizing “maybe X causes Y…” for a few million different Xs and Ys, testing each hypothesis against the logs until enough had been “disproven” that a clear-ish picture remained. It sounds to me more like botany or zoology than engineering or architecture! You notice that fruit sets in the flowers visited by bees; “hmm, maybe they had something to do with it…”

I find this totally cosmic: Facebook in 2014, by then already a VERY large and successful company—a really powerful force in the world—did not understand, in a pretty deep way, how Facebook worked.

Maybe Facebook at the height of its “move fast and break things” era represents an extreme case. But the problem described in the Mystery Machine paper is not uncommon; the heterogenous, asynchronous nature of large internet systems seems to produce it almost inevitably. An “application” like Facebook or Fortnite is just VERY different from an application like Photoshop or Street Fighter II.

So, at this point, I assume there are similar uncertainties at play in every large internet system, and I assume that the people who build and maintain those systems don’t totally understand them. In that way—I’ve said this for years—I think they operate as much like custodians and caretakers as designers and engineers.

(It’s worth noting that this effect is even more pronounced when it comes to AI models, which are sort of naturally mysterious; for this reason, “AI explainability” is a knotty and important sub-field.)

Anyway, it feels to me like the ghost fave bug must have something to do with two or more of the many components in Twitter’s own mystery machine not “talking to each other” correctly; the API endpoint dutifully signals the database, but, elsewhere, a cache isn’t reset… something along those lines. But that’s just a guess. Honestly, I’d love to know if, inside Twitter HQ, the bug is totally understood, just not a priority… or if there’s some mystery to it. I’m rooting for mystery.

July 2021, Oakland

You can explore my other blog posts.

The main thing to do here is sign up for my email newsletter, which is infrequent and wide-ranging. It goes out to around 18,000 people, but/and I try to make it feel like a note from a friend: