Calibration matters more than accuracy in flop prediction

A 90 percent accurate flop predictor that is over-confident is worse than an 80 percent accurate one that knows when it does not know.

Most pitches for script-prediction tools lead with accuracy. Ours leads with calibration. The two are different, and the difference is the entire reason a studio committee should take a model seriously or ignore it.

Here is the distinction in one paragraph. Accuracy asks: how often is the model right? Calibration asks: when the model says it is 80 percent confident, is it right 80 percent of the time? A model can be highly accurate and badly calibrated. It can correctly call most flops as flops, and yet, when it expresses 95 percent confidence, only be right 70 percent of the time. If you bet money on its 95 percent calls, you lose. The accuracy number told you nothing about how to size that bet.

In script prediction, this matters more than in almost any other domain. The decisions are large (a studio greenlight is a nine-figure commitment). The decisions are infrequent (a studio makes a few dozen of these a year). Most decisions are not made on the model's call alone; the model is one input among many. The relevant question for the executive is not "does the model agree with me?" The relevant question is "if I and the model disagree, how much should that move me?" That question can only be answered if the model is calibrated.

The pathological case for accuracy-first models in this domain is the model that learns to predict the prior. The base rate of theatrical flops, defined as films that fail to recover their production budget on the first window, is around 60 percent. A model that always predicts flop, with high confidence, is 60 percent accurate. It is also useless. Calibration would expose this immediately: the model's 95-percent-confident flop calls would have a 60 percent hit rate, not 95.

The way we test for calibration is unglamorous. We hold out a chronological tail of our training data, score every script the model has not seen, and bin the predictions by predicted probability. Then we compare the predicted probability in each bin to the actual hit rate. A perfectly calibrated model produces a diagonal line. Our most recent script-intelligence head is within two percentage points of the diagonal across all deciles, which is what we ship against. We re-run the calibration test weekly on incoming data; if a decile drifts more than three points, that is an incident and we recalibrate.

This is also why every prediction response from SignalGrid carries a calibration_bucket field. It is the single most important field for using the prediction correctly. It tells you the historical hit rate of predictions at the model's current confidence level. If the model says 80 percent and the bucket says 78 percent, treat the score as the score. If the model says 80 percent and the bucket says 62 percent, the model is currently over-confident in that range, and you should weight the score accordingly.

We do not lead our marketing with our accuracy numbers, and we never will. Accuracy is the wrong number to optimize when the stakes are high and the decisions are rare. Calibration is the discipline of telling the user how much to trust a number, which is the only discipline that lets the user actually trust a number.

Calibration matters more than accuracy in flop prediction

More from the blog