Data Center Power Availability is More Than a Number

It can be difficult to appreciate the differences between data center power delivery architectures. Every data center provider you talk with has a power story to tell and most of them sound pretty good. The challenge is selecting the power architecture that is right for you.

One way to compare power delivery systems is to look at overall availability. Availability is usually expressed as a percentage of the total time the system is expected to be running. This is where the number of 9’s come in. You might find availability percentages of three 9’s (99.9%), four 9’s (99.99%), etc.

While a number might make you feel better at night, it won’t necessarily keep the phone from ringing. I suggest you look for three words when evaluating data center power delivery systems: Redundant, Distributed, and Scalable.

Redundant – multiple independent systems

This is the "N" you hear so much about in data centers. Basically, N is the amount of a component that you need to deliver a service. For example, you need one power path from the utility to your server rack. This path could include multiple pieces of equipment including a main switchboard, backup generator, UPS (uninterruptible power supply), and a PDU (power distribution unit). A second independent power path with all of those elements to the same server rack would be 2N. If there is a break in the first path, then the second path takes over. Ask your data center provider how they design for redundancy in all critical systems. The most fault tolerant way to keep your system running is to have multiple N’s. The challenge is that too many N’s can be expensive to acquire and complicated to manage.

Distributed – a resource pool + backup(s)

This is the "+1" in an N+1 design or the "+2" in an N+2 design. When the costs of the redundant architecture become prohibitive, a distributed approach for critical elements of the infrastructure is a great way to improve overall system reliability. You take a device and set it up as a spare for the required pool of devices. Say you need five UPSs to run the pod and you have one additional UPS that can backup any of the five in the pool – that’s N+1. Two spares for the pool means N+2. The critical element of this configuration is the monitoring and management system that must recognize a device failure and automatically switch to the back-up device.

Scalable – engineered for growth and change

Data centers are continually growing and changing all at the same time. To deliver superior service at an affordable price, the data center should be built out based on usage. The shell may be in place day 1, but the power and cooling should be purchased and installed as customers move in. Otherwise you are paying too much, too soon. Also, within the IT cages, servers, storage devices, and network gear are being added, removed, upgraded, and relocated. Your data center needs to be engineered for both growth and change. Power and cooling systems should accept additional devices as capacity requirements grow. Live IT power load should be dynamically shared or moved across the entire facility. All of this must occur without an outage.

Ragingwire data center - 100% power availabilityAt RagingWire, we’ve coined a name for our redundant, distributed, and scalable power deliver architecture – we call it 2N+2. We have two patents on the technology and offer a 100% Availability SLA (service level agreement) with these configurations.

How can we be so confident that with our 2N+2 architecture your power will not go down? We have 2N redundancy on the power paths to your cage or rack and an N+2 distributed design for the critical elements in the power delivery system. Lastly, one of our patented inventions is a unique cross-facility power switching fabric and a massively scalable topology that allows us to move, share, and scale live IT power load throughout the data center without requiring a maintenance outage.

For data centers, the old adage definitely still applies: "You can’t manage what you don’t measure." Availability numbers are a great metric to manage your data center power delivery system. However, when choosing the right data center colocation solution, be sure to look for power delivery systems that are redundant, distributed, and scalable.

When N+1 just isn’t good enough

2006 was a pivotal year for RagingWire. 2006 was the year RagingWire learned that for data centers, N+1 just isn't good enough. 2006 is the year RagingWire went dark. It started normally enough – a beautiful spring day in April. During normal operations, a 4,000Amp breaker failed. Material failures happen, even with the best maintenance programs in place. Our UPS's took the load while the generators started – then the generators overloaded. The data center went dark.

After bringing the data center back online, we performed a detailed post-mortem review and identified the root causes of the outage to be design flaws and human error. Our management team declared that this could never, ever happen again. We knew that we needed to invest heavily in our people, and that we needed to rethink how data centers operate. We started with investing in our people because human error can overwhelm even the best of infrastructure designs. We focused our recruitment efforts in the nuclear energy industry and the navy nuclear engineering program – both working environments where downtime is not an option and process control, including operations and maintenance, is second nature. We hired a talented team and asked them to design and operate our data center to run like a nuclear sub.

Our revamped team of engineers determined  that the then-current N+1 design did not meet their requirements, so they changed it and implemented the concept of a 2N+2 design. Their work was recognized last week as RagingWire announced the issuance of Patent #8,212,401 for “redundant isolation and bypass of critical power equipment.” This is one of 2 patents that resulted from RagingWire’s outage in 2006 and our efforts to design a system that would never go down again.

RagingWire’s systems are built to a 2N+2 standard. RagingWire exceeds the Uptime Tier IV standard by providing fault tolerance during maintenance. We call this “fix one, break one” or FOBO. This means that any active component – UPS, generator, chiller, pump, fan, switchboard , etc. – can be removed from service for maintenance, any other active component can fail, AND we can experience a utility outage, all without loss of power or cooling to the server rack. Having this extra level of redundancy allows RagingWire to perform more maintenance, and to do so without worry about a loss in availability. This enables us provide a 100% uptime SLA, even during maintenance windows.

List of data center outagesLooking at the last year and a half, it’s clear that many data centers are still providing their customers an inferior N+1 design. How do you know? Simply look at the number of providers below who have suffered data center outages over the past 18 months. Since 2006, RagingWire has had 100% availability of its power and cooling infrastructure due to its superior 2N+2 design. If your current provider is still offering N+1, maybe it’s time to ask yourself if N+1 is still good enough for you.

October 22, 2012Amazon Web Services suffered an outage in one of its data centers that takes down multiple customers in its US-East-1 region.  The problem was attributed to a “small number” of storage volumes that were degraded or failed.

October 8, 2012 – A cable cut took down Alaska Airlines’ ticketing and reservation system, causing delays across the airlines’ operations and preventing customers from checking in for flights.

August 7, 2012 – A fiber cut takes nonprofit Wikipedia offline for an hour.

July 28, powered off 1,100 customers due to human error during preventative maintenance on a UPS in their Newark, De data center.

July 10, 2012Level3 East London data center offline for 5 hours after a UPS bus-bar failed.

July 10, suffers worldwide outage after a power failure in one of Equinix’ Silicon Valley data centers.

June 29, 2012Amazon Web Services suffers a power outage in its Northern Virginia data center. Multiple generators failed to start automatically due to synchronization issues, and had to be started manually.

June 14, 2012Amazon Web Services suffers a power outage in its Northern Virginia data center. The problem was blamed on a defective generator cooling fan and a mis-configured power breaker.

June 13, 2012 – US Airways had a nationwide disruption of their computer system, affecting reservations, check-in and flight status due to a power outage at their AT&T data center in Phoenix.

January 20, 2012 – A power failure in Equinix SV4 data center took several customers including Zoho offline.

October 10, 2011Research in Motion cut Blackberry service to most of Europe for 6 hours due to a power failure in their Slough, UK data center. The outage caused service disruptions for 3 days worldwide.

August 11, 2011Colo4 in Dallas, TX failed an automatic transfer switch, resulting in a 6 hour power outage.

August 7, 2011 - Amazon Web Services Dublin, Ireland data center lost power due to a generator phase-synchronization error, disrupting service to the EU West region.

The Power of N

If you have been in or around data centers over the last 10 years, you have experienced the power of N. This single letter drives the architectural standards and design philosophies of the entire data center industry. There a lot of N’s in the data center industry -- N, 2N, N+1, N+2, and (2(N+1)).

Now RagingWire is introducing a new N called 2N+2. Why are we doing this? Well the other N’s didn’t measure up to the task of describing our patented critical infrastructure architecture.

What is N?
N is the amount of something you need in order to deliver a service or load. For an IT shop, N could be the number of servers you need to deliver a defined processing capacity. In a data center, N could be the number of UPSs (uninterruptible power supplies), generators, MSB (main switchboards) you need to deliver a power load. Of course in an N configuration, you need to hold constant the capacity of each element that makes up the N.

With N as your base, the next step is to identify the number of spare devices and complete backup units in your configuration. For example, let’s say you need 10 servers to run a cloud application. If you have a total of 14 interconnected servers with 10 production devices and four spare units, then you have an N+4 design. If you have two independent configurations of 10 servers each that can back each other up, then you have a 2N design.

N is a useful approach when describing the world of physical devices to deliver a certain capacity. The challenge is information technology and data centers are becoming increasing virtualized where pools of capacity are available and dynamically configurable. Devices still matter, but so does continuous monitoring and dynamic management of the capacity those devices deliver.

2N+2 Delivers 100% Availability
RagingWire’s patented 2N+2 design describes the physical devices, virtualized capacity, and PLCs (programmable logic controls) that enable us to deliver a data center with 100% availability even during maintenance periods and a utility outage.

We call the PLCs and integrated data center infrastructure management (DCIM) system, N-MatrixTM. With N-Matrix, we can combine our 2N power paths and N+2 critical infrastructure to deliver a 2N+2 data center – the most reliable data center design in the world.

Technology is great, but it’s all about the people

Often, when we take potential customers through our data centers and show them our patented technology, they remark at what incredible technology we have designed and implemented. My first response is always this: it is a result of the people we hire to design, build, and operate our data centers. My two priorities in anything we do are availability of the customer application and outstanding customer service. These are enabled by technology, but driven by people. As demonstrated by numerous studies in the data center industry and from my previous life in the nuclear industry, people remain and are still the leading cause of downtime in data centers (more on that in follow-on posts).

First, hire the right people and then give them the tools to succeed. One of the best things RagingWire has done is give our employees and our clients a clear definition of our data center design: "Fix one, break one, concurrent with a utility outage." In other words: we are designed for concurrent maintainability and fault tolerance during a utility outage, including power or water -- this philosophy resonates through RagingWire's design, construction, and operations groups, and even in our concurrent engineering sessions with our clients. The philosophy is driven by the people we have at RagingWire.

Many people in the industry have tried to treat the data center as commoditized real estate. It is unequivocally not real estate; it is a product which at the end of the day delivers availability of a service and an application to our customers. As people try to commoditize and treat data centers as real estate, they lose focus on availability and product delivery and therefore they outsource design, construction, and operations - driving down service and quality. Data centers, the product that we provide, and the availability of the service is not a commodity that can easily be white washed between providers. There is an amazing amount of technology and innovation being put into our data centers and the product is backed up by incredible people dedicated to the availability and uptime of that product.

RagingWire has made a conscious decision to hire and in-source the life cycle of the data center. We design what we build, we build what we design and we operate what we design and build. And we provide these resources to our customers to ensure that when they build out, their IT environment is as hardened and redundant as possible and that their hardware, network and application level architecture is designed in conjunction with our data center design. The people, enabled by the technology, are the cornerstone of how we accomplish this with our clients and provide 100% availability of their applications and services.

Whenever we search for potential technology vendors, RagingWire always interviews the provider’s team and makes an evaluation of the people behind the product. You can take the greatest technology in the world, place it in the wrong hands and end up with a product that no one wants. Similarly, the right people can make all of the difference, especially when given incredible technology and tools.

The next time you go to your data center, evaluate the technology, how they do business, and their availability record. Just as important, evaluate who is behind the product and the people that are ultimately going to be ensuring your critical application availability.

Why the City of Seattle’s Data Center Outage Last Weekend Matters

There are a lot of things that the City of Seattle did right in their management of last weekend’s data center outage to fix a faulty electrical bus bar. They kept their customers – the public – well informed as to the outage schedule, the applications affected, and the applications that were still online. Their critical facilities team completed the maintenance earlier than expected, and they kept their critical applications online (911 services, traffic signals, etc) throughout the maintenance period.

Seattle’s mayor, Mike McGinn, acknowledged in a press conference last week on the outage that the city’s current data center facility “is not the most reliable way for us to support the city’s operations.” Are you looking for a data center provider, especially one where you’ll never have to go on record with that statement? If so, here are a few take-aways:

A failure to plan is a plan to fail. While the city of Seattle planned to keep their emergency and safety services online, had they truly planned for the worst? I’m sure they had a backup plan if the maintenance took a turn for the worse, but did they consider the following: what if a second equipment fault occurs? Traditionally, the “uptime” of an application is defined as the amount of time that there is a live power feed provided to the equipment running that application. I would offer a new definition of “uptime” for mission critical applications: the time during which both a live power feed and an online, ready-to-failover redundant source of power is available to ensure zero interruptions. “Maintenance window” shouldn’t be part of your mission critical vocabulary. Which brings me to my next point . . .

Concurrent maintainability and infrastructure redundancy is key. I will go one step further – concurrent maintainability AND fault tolerance are key factors in keeping your IT applications online. The requirement to perform maintenance and sustain an equipment fault at the same time isn’t paranoia – it’s sound planning. Besides, a little paranoia is a good thing when we’re talking about applications like 911 services, payment processing applications, or other business-critical applications.

Location.  Location.  Location. The city of Seattle’s data center is located on the 26th story of a towering office building in downtown Seattle. The fact that they had to take down multiple applications in order to perform this maintenance implies that the electrical feed redundancy to their data center is somewhat limited. There are many competing factors in choosing data center location: risk profile, electrical feed, connectivity options, and natural hazard risk profile, to name a few. For mission critical applications, your location choice has to center on factors that will keep your systems online 100% of the time.

Flexibility and scalability give your IT department room to breathe. The city of Seattle leased out their single-floor data center space before the information economy really took hold. As a result, their solution is relatively inflexible when it comes to the allowable power density of their equipment. They’re quickly outgrowing their space and already looking for an alternate solution. Look for a data center provider that focuses on planning for high-paced increases in rack power draw – do they already have a plan for future cooling capacity? How much power has the facility contracted from the local utility?


Subscribe to RSS - Architecture