Too hot to handle
With ever increasing complexity in the software stacks running on our systems, we are starting to take stuff that feeds us, like power and cooling for granted. Sure, on a global scale we have one of the most reliable power feeds from the net in the Netherlands. This is backed up by diesel engines and a fully redundant power grid inside our primary data center. To get the generated heat out, there’s a fully redundant cooling system in place.
So with all this power and cooling hardware in place, we’re protected against everything… right? Well think again, because the power grid and air conditioning systems are also controlled by…. software! A seemingly harmless software update to the ACU’s inside one of our suites caused a control valve to react in the opposite way its control software thought it was sending them, effectively shutting down cooling and causing a 10 degrees centigrade temperature rise in little over 30 minutes. These are the type of temperature rises which ultimately cause hardware to auto shutdown. In this case, the problem was cleared before reaching critical levels. If it hadn’t, we would have been able to transparently fail everything over to a remote location, since the typical infrastructures we build are based on a twin data center active / active concept.
This again proves that it doesn’t always have to be the often cited ‘plane crash’ which proves the point for building mission critical infrastructures, like our customer’s, inside multiple data centers. Actually, I don’t think there are any recorded events of an airplane crashing into a data center. Instead, something like the firmware controlling your ACU’s can jeopardize all equipment inside a single room or even an entire data center. Plan for failure and expect failure to come from unexpected sources.
All things considered, the twin datacenter active/active configuration is indeed too hot to handle!