Robin Sloan
the lab
July 2021

Ghost faves in the mystery machine

Hamlet's friend restrain him while a ghost beckons; a mono­chrome etching, fine-lined, almost like a comic book illustration.

Hamlet, Horatio, Marcellus and the Ghost, 1796, Robert Thew

Recently, I was chatting with a friend about a problem they’d encoun­tered after running my script to delete their old Twitter faves. (These are tech­ni­cally “likes” now, I know, but I will stub­bornly call them faves, because it is the superior word.)

The problem was that, although the API reported a successful unfaving, many old faves were still attached to my friend’s profile. Here’s the strange part: when viewed, the tweets them­selves showed hollow hearts, visibly unfaved … yet there they remained, a ghostly list, somehow both fave and not.

Even stranger — at this point it’s getting delicious — the ghost faves could be banished at last by refaving and then unfaving them; by power-cycling the little heart.

Cursory inves­ti­ga­tion indicates this is a wide­spread problem. A search for “twitter phantom likes” will reveal many people describing the same behavior exactly, with no evidence of a solution anywhere. Twitter even released a fresh new API endpoint for managing faves — and still, the ghosts are beyond its reach.

This is clearly a bug — the API is not doing what it’s supposed to do, or even what it “thinks it’s doing”—but I am not here to discuss a bug; rather, what the bug made me think about.

My under­standing of large internet systems was trans­formed when I read about Facebook’s Mystery Machine. This was a tool docu­mented back in 2014; I assume it is not in use anymore, but/and, to me, it’s still exemplary of the way these systems work, or don’t.

Facebook’s paper lays out the problem:

Current debugging and opti­miza­tion methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous hetero­ge­neous software compo­nents over a distrib­uted set of computers.

In other words: how do you analyze or debug a system made up of many different compo­nents, written by many different people in many different program­ming languages, running in many different places at many different times … that activate each other in compli­cated cascades? A system that grew organically, and very quickly? That was perhaps, heh, not perfectly docu­mented along the way?

A “rational” answer might be: you write some code that captures an X-ray of what’s happening, and you put that code into all those compo­nents! I mean, this kind of software exists. But the reality of Facebook circa 2014, its messiness and scale, meant that wasn’t going to happen.

So what did they do instead?

Consequently, we develop a technique that generates a causal model of system behavior without the need to add substan­tial new instru­men­ta­tion or manually generate a schema of appli­ca­tion behavior. Instead, we generate the model via large-scale reasoning over indi­vidual software component logs.

In plain language: they watched the whole system as it ran, the parts of it they could see — the log messages — and inferred its operation from that activity, the way it played out in time. The Mystery Machine worked by literally hypothesizing “maybe X causes Y … ” for a few million different Xs and Ys, testing each hypoth­esis against the logs until enough had been “disproven” that a clear-ish picture remained. It sounds to me more like botany or zoology than engi­neering or architecture!

I find this totally cosmic: Facebook in 2014, by then already a VERY large and rich company — a powerful force in the world — did not under­stand, in a pretty deep way, how Facebook worked.

Maybe Facebook at the height of its “move fast and break things” era repre­sents an extreme case. But the problem described in the Mystery Machine paper is not uncommon; the heterogenous, asyn­chro­nous nature of large internet systems seems to produce it almost inevitably. An “application” like Facebook or Fortnite is just VERY different from an appli­ca­tion like Photoshop or Street Fighter II.

So, at this point, I assume there are similar uncer­tain­ties at play in every large internet system, and I assume that the people who build and maintain those systems don’t totally under­stand them. In that way, I think they operate as much like custo­dians and care­takers as designers and engineers.

(It’s worth noting that this effect is even more pronounced when it comes to AI models, which are sort of naturally mysterious; for this reason, “AI explainability” is a knotty and important sub-field.)

Anyway, it feels to me like the ghost fave bug must have something to do with two or more of the many compo­nents in Twitter’s own mystery machine not talking to each other correctly; the API endpoint dutifully signals the database, but, elsewhere, a cache isn’t reset … something along those lines. But that’s just a guess. Honestly, I’d love to know if, inside Twitter HQ, the bug is totally understood, just not a priority … or if there’s some mystery to it. I’m rooting for mystery.

July 2021, Oakland