When N+1 just isn’t good enough

Bill Dougherty

2006 was a pivotal year for RagingWire. 2006 was the year RagingWire learned that for data centers, N+1 just isn't good enough. 2006 is the year RagingWire went dark. It started normally enough – a beautiful spring day in April. During normal operations, a 4,000Amp breaker failed. Material failures happen, even with the best maintenance programs in place. Our UPS's took the load while the generators started – then the generators overloaded. The data center went dark.

After bringing the data center back online, we performed a detailed post-mortem review and identified the root causes of the outage to be design flaws and human error. Our management team declared that this could never, ever happen again. We knew that we needed to invest heavily in our people, and that we needed to rethink how data centers operate. We started with investing in our people because human error can overwhelm even the best of infrastructure designs. We focused our recruitment efforts in the nuclear energy industry and the navy nuclear engineering program – both working environments where downtime is not an option and process control, including operations and maintenance, is second nature. We hired a talented team and asked them to design and operate our data center to run like a nuclear sub.

Our revamped team of engineers determined  that the then-current N+1 design did not meet their requirements, so they changed it and implemented the concept of a 2N+2 design. Their work was recognized last week as RagingWire announced the issuance of Patent #8,212,401 for “redundant isolation and bypass of critical power equipment.” This is one of 2 patents that resulted from RagingWire’s outage in 2006 and our efforts to design a system that would never go down again.

RagingWire’s systems are built to a 2N+2 standard. RagingWire exceeds the Uptime Tier IV standard by providing fault tolerance during maintenance. We call this “fix one, break one” or FOBO. This means that any active component – UPS, generator, chiller, pump, fan, switchboard , etc. – can be removed from service for maintenance, any other active component can fail, AND we can experience a utility outage, all without loss of power or cooling to the server rack. Having this extra level of redundancy allows RagingWire to perform more maintenance, and to do so without worry about a loss in availability. This enables us provide a 100% uptime SLA, even during maintenance windows.

List of data center outagesLooking at the last year and a half, it’s clear that many data centers are still providing their customers an inferior N+1 design. How do you know? Simply look at the number of providers below who have suffered data center outages over the past 18 months. Since 2006, RagingWire has had 100% availability of its power and cooling infrastructure due to its superior 2N+2 design. If your current provider is still offering N+1, maybe it’s time to ask yourself if N+1 is still good enough for you.

October 22, 2012Amazon Web Services suffered an outage in one of its data centers that takes down multiple customers in its US-East-1 region.  The problem was attributed to a “small number” of storage volumes that were degraded or failed.

October 8, 2012 – A cable cut took down Alaska Airlines’ ticketing and reservation system, causing delays across the airlines’ operations and preventing customers from checking in for flights.

August 7, 2012 – A fiber cut takes nonprofit Wikipedia offline for an hour.

July 28, 2012Hosting.com powered off 1,100 customers due to human error during preventative maintenance on a UPS in their Newark, De data center.

July 10, 2012Level3 East London data center offline for 5 hours after a UPS bus-bar failed.

July 10, 2012Salesforce.com suffers worldwide outage after a power failure in one of Equinix’ Silicon Valley data centers.

June 29, 2012Amazon Web Services suffers a power outage in its Northern Virginia data center. Multiple generators failed to start automatically due to synchronization issues, and had to be started manually.

June 14, 2012Amazon Web Services suffers a power outage in its Northern Virginia data center. The problem was blamed on a defective generator cooling fan and a mis-configured power breaker.

June 13, 2012 – US Airways had a nationwide disruption of their computer system, affecting reservations, check-in and flight status due to a power outage at their AT&T data center in Phoenix.

January 20, 2012 – A power failure in Equinix SV4 data center took several customers including Zoho offline.

October 10, 2011Research in Motion cut Blackberry service to most of Europe for 6 hours due to a power failure in their Slough, UK data center. The outage caused service disruptions for 3 days worldwide.

August 11, 2011Colo4 in Dallas, TX failed an automatic transfer switch, resulting in a 6 hour power outage.

August 7, 2011 - Amazon Web Services Dublin, Ireland data center lost power due to a generator phase-synchronization error, disrupting service to the EU West region.

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.