Failing Up: Stronger Data Centers through Incident Management
In the critical facilities industry, incidents are typically given a bad rap. Executives and operators view incidents – events that affect the redundancy of the data center – as bad business. So winning an award for managing incidents would seem like being recognized for your ability to bail water rather than build a sound boat. But to the right company, incidents aren’t a measure of failure; they’re challenges that improve your business process. The upside to incidents is the ability to learn from them and, more importantly, the opportunity to share those lessons with others; both internally and throughout our industry.
Bob Wichert and TJ Ciccone from RagingWire receiving the 2014 Uptime Institute Incident Management Award
At the Uptime Institute’s Critical Facilities Summit in Charlotte, NC on October 5, 2015, RagingWire Data Centers received the 2014 Uptime Incident Management Award. This award was presented in recognition of achievement in tracking and responding to – not avoiding – incidents in data center infrastructure (as determined by incident contributions to the Uptime Institute Network Abnormal Incident Report (AIR) Database).
In simplest terms, this does not mean that RagingWire experienced the most incidents. It means as a company, we successfully capitalized on them; helping to spread knowledge to other members of the organization. So much so, in fact, that we submitted more than three times the amount of lessons learned over our nearest competitors.
What does it take to win this award? It takes an operational commitment to sharing data regarding incidents at your facility and implementing changes to prevent them in the future. It is a humbling, but rewarding, task. By being active participants in the AIR database, we have been able to collectively gather statistical data that has helped shape our data center world today.
When needing to build a case for 24/7 staffing, you can access the database and track the percentage of incidents that occur during non-peak hours. If you think your site is incurring an abnormal amount of faults on a piece of equipment, or a high Mean Time Between Failures (MTBF), you can turn to the database and search for others who may be experiencing the same issue. Wondering if a new type of cooling solution would be a good fit? The shared data can help you make a more informed decision.
We all have incidents, let’s just admit that together. They are an unavoidable side effect of what we do, and certainly a smart data center strives not to make the same mistake twice. But what defines your business is not the ability to never have an incident – which would require some tricky bookkeeping and diligent rug-sweeping – but the ability to learn from them and come out stronger as a company, and ultimately as an industry.