Football. CoViD-19, and distributed systems hazards

Looking at the latest trickle of Covid-19 cases in the NFL – specifically in the Patriots locker room – it strikes me that some of the challenges of public health safety are strikingly similar to the issues of distributed system safety in computer systems, and each can help highlight important lessons in the other.

Caveats: I am not an epidemiologist, nor do I play one on TV. There is still a lot we don’t have certainty on around Covid-19, from incubation periods and transmission mechanics, to testing reliability and safety protocols. I am also not an NFL insider, and most of my information about what the NFL is doing is inferred from the fantastic coverage of a number of NFL reporters, especially on the Patriots and medical beats. But we can infer and assume some things for the sake of thinking about the safety of the NFL’s apparent distributed safety protocol.

Background: There are two interesting classes of tests for Covid-19: qPCR and POC. qPCR (quantitative polymerase chain reaction) is the more reliable test, but takes a number of hours to get a result; it appears a 10-12 hour lag from point of test to results being made available, based on available sensors (we see players get tested before practices, and we hear about results in that evening/night). POC (point of care) tests are rapid tests – with results in minutes – but are lower reliability. Reliability here is a combination of both sensitivity (a positive result means the person has Covid-19) and specificity (a negative result means the person does not have it). qPCR tests are more expensive, consuming scarcer resources than POC tests do.

The NFL protocols, in general, appear to be oriented around qPCR tests; given every day except gamedays; and once an outbreak occurs in a locker room, adding in POC tests for that team.

The Patriots have now had four players test positive for Covid-19: Cam Newton (reported on Saturday, October 3rd, positive result presumably from Friday), Stephon Gilmore and Bill Murray (reported Tuesday October 6th, presumably from a test on that same day), and Byron Cowart (reported on Sun October 11th, presumably from a test on October 10th). That puts 4 days between successive positive test (putting Gilmore and Murray together as one infection).

There are two interesting distributed systems principles that might be informative here. One is pipeline problems with polling, and the other is Time Of Check / Time Of Use errors. Let me explain each of those.

Imagine you have a pipeline of ten actions that happen one after another; and each takes under a minute to accomplish. Mentally, you likely just thought “Well, I add up the time each one takes, and that tells me how long the pipeline will operate.” And, likely, you’d be wrong; because the mental model you used – of work being cleanly handed off from one task to another – doesn’t match how polling-based systems operate. In a polling-based system, given tasks aren’t waiting for input. Rather, at a set interval, they look to see if input is waiting, and if so, execute their task (this is for simplicity; it’s a lot easier to write software that just does one thing, and then use existing software components to schedule the work, than to write software that has to deal with listening for input). If the software components each wake up every minute on the minute, then a ten-task pipeline takes about ten minutes (as long as each task takes less than one minute), with only the final task shaving any time off the problem. Any task that becomes more than a minute long? Adds another minute to the pipeline – so a ten task pipeline of 0:59 tasks takes ten minutes, while a ten task pipeline of 1:01 tasks takes … twenty minutes.

How does the polling problem affect the NFL? By only testing the players once a day, infection detection latency becomes longer. A player who becomes detectably infectious 1 minute after being tested won’t be detected for another 36 hours or so. Which brings us to our second interesting principle.

Time of Check / Time of Use (TOCTOU) errors are a class of problems where a computer system is doing a potentially dangerous operation. Before doing that dangerous thing, one safety precaution is to test whether that operation is safe to do, and then the system can take the action. Unfortunately, the unsafe condition that you’re testing for might come into being between when you test and when you execute, and that creates a problem. This affects the NFL because even if there weren’t latency issues in their testing, a negative result in the morning doesn’t assert that the player won’t become contagious an hour later.

Based on the test results, it appears that there is a cycle frequency of around four days between infections; let’s assume for a moment that the Patriots players infected each other (that is, Newton infected Gilmore and Murray, and Gilmore (or Murray) infected Cowart. Let’s try to pin dates on those. Working backwards, Since Gilmore tested positive on October 6th, let’s posit that he passed on Covid-19 to Byron Cowart on October 5th, which means it took 5 days for Cowart’s infection to be detected. If Cowart passed it to someone yesterday, those 5 days mean we’d expect to see another positive result around Thursday, October 15th. Why did I add an extra day? Because while we might assume that Newton passed on the infection either the day before his test or the day of, and Gilmore the day before, the Patriots didn’t hold a practice the day before Cowart’s test.

But that target date of the 15th ignores that coronavirus incubation periods aren’t fixed. Past coronaviruses seem to have a two to ten day window; which means to halt this outbreak, the Patriots should be isolated at least through the 20th.

There’s another problem with the NFL’s testing regime that shows up in security compliance regimes, which seems to be that the NFL doesn’t seem to be looking enough. System owners often have a feeling of “if I didn’t know I had a problem, it’s not my fault that I’m not doing anything about it, so let’s not look.” The NFL should be testing every player as frequently as they can get away with; a POC & qPCR test should be performed together. Every morning and every afternoon, so that an infectious player at a practice has a chance of being detected before the following day. Ideally, that afternoon test would be done 10-12 hours before the following days practice (or whatever latency the pQCR test results have), to maximize the benefit of a planned pipeline. Additional POC tests might be scattered throughout the day to reduce detection latency.

However, testing isn’t a substitute for a well-designed system. Symptomatic individuals have tested negative; asymptomatic individuals have tested positive. A willingness to react to even the hint of a problem in a well-controlled fashion is essential, and not be tied to an optimistic schedule of “play games every 7 days” – especially with an incubation period that so closely matches the playing schedule.

There are probably other changes the NFL should consider; much has been made of the picture of Stephon Gilmore and Patrick Mahomes interacting with each other after their game; while we all enjoy the sportsmanship players show after the game, that should probably also be curtailed.