Data Center Emergency Preparedness: When too much is not enough.

by Chris Thames
01 November 2012

One of the most important aspects of data center operations is risk management or mitigation. Data center operators typically operate with a proactive mentality in order to properly react to any given situation which ultimately reduces the risk exposure of the facility. Training, preventive maintenance, and regular system or equipment testing becomes second nature as these facilities are expected and do (for the most part) operate seamlessly 24x7x365.25 days a year; however, it’s the once in a while event which tests the true resiliency of the facility and pushes the operations staff to their limits.

An acute level of attention to detail and complete ownership of the facility are common characteristics demonstrated by our operations staff. The team works tirelessly to ensure that they are ready for any given scenario. Emergency preparedness checklists are created, inventories are taken, and procedures are created for most common scenarios for events which carry the greatest potential to take place within the facility; however, we often find that with all of our efforts dedicated to ensuring our preparedness… it’s not enough.

Recent events within the Northeast are bringing to light scenarios which the operations team may not be prepared to handle in order to ensure the continuous uptime of the critical mission. Real world examples are as follows: generator loss of fuel delivery requiring re-priming of the fuel system, emergency redistribution of proprietary electrical feeds at the rack level, unusual roof leaks, flooding, staffing relief plans, and communications challenges. When creating the emergency operating procedures or casualty control procedures, emphasis must be placed on scenarios whereas the staff must be able to react and focus on the fact that external help is not available.

One very real scenario that we should proactively run drills on is generator loss of prime and how to re-prime engines due to fuel flow issues or based on the need to change the fuel pre-filters. Many of us do not have paralleled pre-filter assemblies, which means that once fuel pressure starts dropping due to extended run time, the fuel pre-filters will have to be changed which increases the chances for a loss of fuel prime. Onsite staff must have the ability to change the fuel filters and must be trained on re-priming the engines particularly when help is not on the way. The only way to do this is to proactively train each member of the operations team through a hands-on approach. Along those same lines, true drills need to be run within the facility and critiques of each drill should be held in order to analyze how the staff performed and continually improve upon existing processes and procedures.

Remember, the data center operation and uptime is just as important as a continually operating nuclear power plant. Here at RagingWire, we employ many former nuclear power operators from the Navy and the civilian sector which couldn’t be better examples of critical facilities operators. We each bring to the table a dedicated critical mentality which is seen not only in our data center infrastructure design or operation but also in the way that we work together within the data center community. As we all continue to recover from the devastating conditions experienced over the last several days, RagingWire is providing unwavering support to our colleagues who are continuing to operate in these adverse conditions.

Impact of Hurricane Sandy in the US east coast

RagingWire’s Northern Virginia data center campus, located in Ashburn, sustained no damage from the hurricane and remained on utility power for the duration of Hurricane Sandy’s assault on the East coast. Our thoughts are with those who continue to recover from the storm and subsequent damage.

Blog Tags:

Chris Thames

Sr. Director of Critical Facilities Operations