Environmental Controls at Planetary Scale

A common set of security control objectives found in standard frameworks (ISO 27002, FedRAMP, et al) focus on environmental controls. These controls, which might focus on humidity sensors and fire suppression, are designed to maximize the mean time between critical failure (MTBCF) of the systems inside a data center. They are often about reliability, not safety1; fixating on over-engineering a small set of systems, rather than building in fault tolerance.

Is the cost worth the hassle? If you run one data center, then the costs might worthwhile – after all, it’s only a few capital systems, and a few basis point improvements in MTBCF will likely be worth that hassle (both in operational false positives as well as deployment cost). But what if you operate in thousands of data centers, most of them someone else’s? The cost multiplies significantly, but the marginal benefit significantly decreases – as any given data center improvement only affects such a small portion of your systems. Each data center in a planetary scale environment is now as critical to availability as a power strip is to a single data center location. Mustering an argument to monitor every power strip would be challenging; a better approach is to have a drawer full of power strips, and replace ones that fail.

The same model applies at the planetary scale: with thousands of data centers all over the world (in most of which the operators already have other incentives to take care of environmental monitoring), a much more effective approach is to continue to focus on regional failover (data centers, metro regions, and countries go offline all the time), and only worry about issues within a data center when they become a noticeable problem.

1 Leveson, Nancy. Section 2.1, “Confusing Safety with Reliability”, Engineering a Safer World, pp 7-14

crossposted at blogs.akamai.com