Bill Dougherty's blog

Earthquakes and Bay Area Data Centers: It’s Not If, but When

It’s been a long time since we’ve had a severe earthquake in the Bay Area, but today, a 6.1 magnitude earthquake struck 6 miles southwest of Napa. If you’ve never experienced an earthquake, trust me, 6.1 is a big one and scary! We live in Napa and our whole house was shaking at 3.20 AM!

As I help friends and family clean up today, I had a few thoughts to share with you. On a personal level, I’m thankful everyone is safe and accounted for. This earthquake had the potential to be much worse. Because the quake hit early in the morning, most people were home and asleep. Fortunately, the older buildings that were damaged were mostly unoccupied. All that we lost was stuff, and in the end, stuff doesn’t matter that much.

Bay Area Data Centers and Earthquake Risks

From a work perspective, it was a good reminder why RagingWire considers natural disaster risk as a primary selection criteria when building our data centers. We call our Sacramento data center campus "The ROCK" for a reason. That’s because it’s built on bedrock and is far from the earthquake risk zone of Northern California. Even though we’re only driving distance from San Francisco (90 miles) and San Jose (120 miles), we are a world apart when it comes to natural disaster risks.

The last major earthquake in the Bay Area was the Loma Prieta quake in 1989. A magnitude 6.9 shaker that caused part of the Bay Bridge to collapse and interrupted the World Series. Back then, like today, Sacramento was unaffected, because Sacramento is on a different tectonic plate and essentially has no earthquake risk.

In the 25 years since Loma Prieta, there have been many data centers built in the Bay Area. Memories are short, especially for IT people who weren’t here at the time. The Bay Area is a great place to live and work, but it isn’t an ideal place to put your critical IT infrastructure.

Remember, even if the data center building survives a major quake, the surrounding infrastructure is not resilient. Bridges, roads, power grids, fiber paths, and fuel suppliers are all vulnerable and have a direct impact on your operations and service availability. And there’s no question, another quake will hit the Bay Area.

It’s not a matter of IF, but WHEN.

Blog Tags:

Is Customer Service a Dying Art?

At RagingWire, providing superior customer service is part of our DNA. But lately, dealing with a lot of other vendors, I have to ask, “Is customer service a dying art?”

As an example, I recently moved to a new house. The telephone company took three weeks to move my phone. My work order was screwed up in their system (programming glitch). It took me calling them daily for weeks, getting transferred 5 to 13 times per call, and an unimaginable number of “escalations” to resolve. Finally the “Retention Department,” the one department charged with preventing customers from quitting the phone company, figured out the solution. The rep just typed a new work order and everything started working. Needless to say, I switched companies soon after.

In another situation, I contacted a company because they miscalculated sales tax on an order. They charged tax on a service fee which is a non-taxable item. Rather than resolving the issue and giving me some assurance that they’d fix their problem, the service rep gave me a canned response. “Our system charges tax on everything.” “Even though that’s illegal?” “Yep.” I hope for their sake that a more litigious person doesn’t notice the error.

My point here is not to complain about bad service, but to point out that good service is increasingly rare. The problem is that too many companies are trying to squeeze too many pennies by putting up walls between themselves and the customer. Outsourced call centers, automated phone trees, and refusals to hand out information do not make for a good customer experience.

Our customers hopefully experience something different. Every time you call RagingWire, a real live person answers the phone. Every time – 24x7. And that person’s job is to solve your issue. Or get you to the person who can solve that problem.

And we measure our results. Every time. Every single service ticket is followed up with a survey so our customers can tell us what we did well and where we can improve. And every single survey response is reviewed by management.

Your relationship with your data center is a long term one. It’s expensive and disruptive to switch vendors. Some data center providers take advantage of this, because they believe their customers are trapped. At RagingWire, we value our customers. We consider it an obligation to offer exceptional service so our customers never feel trapped. Ever.

RagingWire Net Promoter Score (NPS) - Aug 2014Every quarter we also ask every one of our customers how we are doing, via a Net Promoter Score (NPS) survey. NPS is the gold standard for measuring customer experience used by some of the world’s best service organizations including USAA, Costco, Apple, and Nordstrom. The average NPS across all companies is +23 on a scale of -100 to +100. A +50 is considered outstanding.

In our most recent quarter, RagingWire earned an NPS of +62, which is top in the data center industry.  By the way, we’ve had a 60+ NPS for four quarters in a row and we are committed to maintaining our leadership position in the industry.

We don’t think customer service is dead yet. At least not for our customers.

Is your data center ready for the coming zombie apocalypse?

Data center designers generally do a good job preparing for conventional risks like earthquakes, fires, floods, and hurricanes, but if your disaster recovery plan doesn’t include provisions for dealing with the undead, your risk mitigation strategy has a gaping hole. Data centers are a natural refuge from zombie hoards, but only if you prepare in advance.

Unlike conventional disaster recovery (DR)/business continuity planning (BCP), zombie preparedness has a unique set of goals beyond data protection and business resumption. RPO/RTO goals go out the window when there’s a geek chewing on your skull. RagingWire has developed a comprehensive zombie survival plan (ZSP) to ensure the long-term survivability of our facilities and our people.

Data Center Preparedness for Zombie Attack

ZSPs vary by company depending on their goals. RagingWire has identified 5 priorities that form the foundation of our zombie preparedness:

  • Containment – Keep the zombies out
  • Endurance – Stay alive until the zombies are gone
  • Sustenance – Don’t go hungry
  • Eradication – Kill every zombie you find
  • Repopulation – Breed new humans for the continuation of the race

Based on these goals, RagingWire redesigned its facilities from the ground up. To do this, we had to create a way of measuring the usefulness of various protections. The zombie protection effectiveness (ZPE) score is a composite average based on a weighted measure of our 5 goals. We use ZPE to prioritize changes we make to our infrastructure, processes, and people. While there are literally dozens of protections we’ve implemented, there are 11 key steps we took that had the greatest ROI and ZPE. Because we want to give back to the community, we are sharing these with the public.

  1. Multi-factor authentication on every door – Our iris scanners verify living tissue, so zombies can’t authenticate past them, but the more protection the better, because we’re certain they won’t remember their pin codes. We installed iris, card and pin code readers on every door.
     
  2. Mantrap every hallway – If an infection breaks out inside our facility, it is critical to contain the zombies in specific zones. Also, we removed crash-bar overrides to the doors. The fire marshal has cited us for this, but after the apocalypse we won’t need permits anyway.
     
  3. Dig a well – Our data center needs a good supply of water. Zombie swarms have an annoying habit of knocking out local utility service so we planned ahead. We also put in filtration systems and above-ground storage. It’s important to have a clean supply when the bodies are piling up at your door.
     
  4. Expanded on-site fuel supply – Most data centers maintain a 24-48 hour supply of diesel with refill contracts. During a zombie attack, our fuel SLAs will likely not be met. We now have a 1000 day supply. As a side note, it is theoretically possible to convert decaying zombie bodies into bio-fuel. This creates an interesting justification for adding Bloom Box fuel cell generators to your design.  On the plus side, zombie biofuel likely generates less CO and CO2 waste than diesel, so you won’t have to worry so much about global warming.
     
  5. Roof-top gardens and animal pens – We’re going to need a food supply. It needs to be out of reach of the zombies. Our roof is the perfect place to start farming. Adding a garden to our roof also creates the opportunity to start using soil-side economizing as part of our normal cooling strategy, improving our PUE. Also, animal waste can be another source of biofuel for those Bloom fuel cells.
     
  6. Ramen. Lots and Lots of Ramen – We mentioned needing food right? Well our fellow data center survivors are mostly tech geeks like us, and tech geeks love Ramen. Ramen also keeps forever so we don’t have to worry about updating our supply every 5 years. We try to maintain at least a 20 pallet supply in our warehouse but for some reason, it keeps disappearing. We also keep onsite a large supply of tactical bacon. Ramen will keep us fed, but tacbac will keep us happy.
     
  7. Automated machine gun turrets with clear fields of fire – The best way to keep zombies out it is to kill them efficiently as they approach the building. But we need to sleep, and we don’t want to go outside to reload. An automated, belt-fed machine gun system that can be reloaded from inside the building and can fire automatically with motion sensors is the best option.
     
  8. Security staff with crossbows and machetes – Research has taught us that the best weapons for close-quarters zombie battle are quiet and reusable. Crossbows never run out of ammo as long as you retrieve your bolts during lulls in the fighting. They also don’t make a lot of noise, unlike shotguns, so they tend not to attract other zombies to our location. And our security staff looks pretty cool with a crossbow on their back and a machete on their hip. Braveheart-blue face paint is of course optional but is highly encouraged.
     
  9. A complete copy of the Walking Dead on DVD – We’re going to need some forms of entertainment. It might as well be educational for our current predicament.
     
  10. Jefferies tubes, everywhere – If our mantraps are full of zombies, we will need other paths through your data center. A well designed system of Jefferies tubes ensures that we don’t get cut off from food, fuel, and fellow survivors.
     
  11. Booze – Let’s face it, after a long day of administrating servers and shooting Zs in the head, you need a drink. Something to take the edge off. We have employed a multi-part strategy.
    • A healthy supply of Scotch, Vodka and Gin not only helps dull the senses when needed, but can also double as an anti-septic.
    • Beer must be kept cold. Under the raised-floor is a perfect place to stockpile a variety of our favorite microbrews.
    • Since repopulation is also a concern, we’ve included a supply of Smirnoff Ice, Zima and Jagermeister.
    • Corn and potatoes on the roof are dual use crops. We can distill them for spirits or fry them up for a late night snack.

We are constantly revising and extending our ZSP best practices guides. This is a community effort. If you create and test your own solutions, please share them in the comments section below. No one knows when the zombie attacks will occur, but we’ll stay online as a resource for as long as possible. If you would like to see these suggestions in action, please schedule a our data center tour. We don’t recommend dropping by unannounced, because you never know when we’ll be testing the machine gun turrets. Good luck, and God bless.

When N+1 just isn’t good enough

2006 was a pivotal year for RagingWire. 2006 was the year RagingWire learned that for data centers, N+1 just isn't good enough. 2006 is the year RagingWire went dark. It started normally enough – a beautiful spring day in April. During normal operations, a 4,000Amp breaker failed. Material failures happen, even with the best maintenance programs in place. Our UPS's took the load while the generators started – then the generators overloaded. The data center went dark.

After bringing the data center back online, we performed a detailed post-mortem review and identified the root causes of the outage to be design flaws and human error. Our management team declared that this could never, ever happen again. We knew that we needed to invest heavily in our people, and that we needed to rethink how data centers operate. We started with investing in our people because human error can overwhelm even the best of infrastructure designs. We focused our recruitment efforts in the nuclear energy industry and the navy nuclear engineering program – both working environments where downtime is not an option and process control, including operations and maintenance, is second nature. We hired a talented team and asked them to design and operate our data center to run like a nuclear sub.

Our revamped team of engineers determined  that the then-current N+1 design did not meet their requirements, so they changed it and implemented the concept of a 2N+2 design. Their work was recognized last week as RagingWire announced the issuance of Patent #8,212,401 for “redundant isolation and bypass of critical power equipment.” This is one of 2 patents that resulted from RagingWire’s outage in 2006 and our efforts to design a system that would never go down again.

RagingWire’s systems are built to a 2N+2 standard. RagingWire exceeds the Uptime Tier IV standard by providing fault tolerance during maintenance. We call this “fix one, break one” or FOBO. This means that any active component – UPS, generator, chiller, pump, fan, switchboard , etc. – can be removed from service for maintenance, any other active component can fail, AND we can experience a utility outage, all without loss of power or cooling to the server rack. Having this extra level of redundancy allows RagingWire to perform more maintenance, and to do so without worry about a loss in availability. This enables us provide a 100% uptime SLA, even during maintenance windows.

List of data center outagesLooking at the last year and a half, it’s clear that many data centers are still providing their customers an inferior N+1 design. How do you know? Simply look at the number of providers below who have suffered data center outages over the past 18 months. Since 2006, RagingWire has had 100% availability of its power and cooling infrastructure due to its superior 2N+2 design. If your current provider is still offering N+1, maybe it’s time to ask yourself if N+1 is still good enough for you.

October 22, 2012Amazon Web Services suffered an outage in one of its data centers that takes down multiple customers in its US-East-1 region.  The problem was attributed to a “small number” of storage volumes that were degraded or failed.

October 8, 2012 – A cable cut took down Alaska Airlines’ ticketing and reservation system, causing delays across the airlines’ operations and preventing customers from checking in for flights.

August 7, 2012 – A fiber cut takes nonprofit Wikipedia offline for an hour.

July 28, 2012Hosting.com powered off 1,100 customers due to human error during preventative maintenance on a UPS in their Newark, De data center.

July 10, 2012Level3 East London data center offline for 5 hours after a UPS bus-bar failed.

July 10, 2012Salesforce.com suffers worldwide outage after a power failure in one of Equinix’ Silicon Valley data centers.

June 29, 2012Amazon Web Services suffers a power outage in its Northern Virginia data center. Multiple generators failed to start automatically due to synchronization issues, and had to be started manually.

June 14, 2012Amazon Web Services suffers a power outage in its Northern Virginia data center. The problem was blamed on a defective generator cooling fan and a mis-configured power breaker.

June 13, 2012 – US Airways had a nationwide disruption of their computer system, affecting reservations, check-in and flight status due to a power outage at their AT&T data center in Phoenix.

January 20, 2012 – A power failure in Equinix SV4 data center took several customers including Zoho offline.

October 10, 2011Research in Motion cut Blackberry service to most of Europe for 6 hours due to a power failure in their Slough, UK data center. The outage caused service disruptions for 3 days worldwide.

August 11, 2011Colo4 in Dallas, TX failed an automatic transfer switch, resulting in a 6 hour power outage.

August 7, 2011 - Amazon Web Services Dublin, Ireland data center lost power due to a generator phase-synchronization error, disrupting service to the EU West region.

Why Send Your Staff to the Data Center to Rack Just One Box?

It's 2a.m. Do you know who is working on your servers?

This week RagingWire announced our new Unlimited Remote Hands and Eyes service, and I couldn’t be happier.  Prior to joining RagingWire, I was a RagingWire customer for over 10 years in addition to working with most of the other major Northern California Data Centers. One of my pet peeves was always the remote hands offerings. They stunk… I could never properly budget for them and I never received consistent service delivery. At most data centers, including RagingWire, remote hands was a time and materials service (T&M) - Every time I called, the clock started ticking. At least RagingWire staffed skilled IT workers 24x7. Too often, the other data centers used security guards to provide their remote hands service.

I believe the combination of T&M billing and inconsistent service causes individuals to make bad decisions. A hard drive fails and a decision is made to roll the dice and wait until you can send someone to the data center to replace it. Or even worse, companies let the proximity of a facility become a primary decision criteria for your data center selection  because you know that occasionally someone needs to touch your equipment and you don’t trust the guys on the other end of the phone.

Help is here. RagingWire’s new service helps address these problems, and more. What we’ve done is create a service that allows for an unlimited number of Remote Hands and Eyes support requests for a fixed monthly fee –with a guaranteed response SLA. We fulfill this service with skilled technicians in our California and Virginia NOCs that are staffed 24x7, not with security guards (I have nothing against security guards, however they should be providing security services, not IT services). Because our service is unlimited, you don’t need to wait until the morning to get someone to move a cable, cycle a server, or swap a tape. Additionally, the clock doesn’t start ticking when you call -- your fee is fixed which makes it easy for budgeting and planning.

So what else is covered by this service? Visual equipment checks, loading media, and incremental changes such as adding a new server or switch – plus more. Why send your staff to the data center for half a day when all you need is one box racked? Ship it to us with instructions. We’ll rack it, cable it, document it, and let you know when it’s ready for use. This really is a great new service and it’s priced low enough to make the ROI compelling for customers of every size.

If you’d like more information, talk to your account rep. I hear we’re giving away a free iPad to one lucky customer who inquires about this service. That could be you!

Data Center Site Selection – Low Risk vs. Close Proximity

Location, Location, Location

When choosing a data center, location counts. Location is often the first criteria discussed when evaluating a new data center partner. But too often, the driver behind location is proximity.  IT professionals like to be close to their servers. As much as we like to talk about operating a lights-out facility, there is always the comfort factor of being able to drive to the data center if there’s a problem.

Instead of proximity, risk should be the determining factor in choosing a data center. Risk of natural and manmade disasters that could affect the availability or performance of the systems you are housing. Intuitively we all understand this, but too often we blind ourselves to the risks that are right in front of us.

For example, most people would consider earthquakes to be the number one natural disaster risk in California. The risk of another Loma Prieta earthquake (6.9 on the richter scale) occurring in the Bay Area in the next 50 years is between 50% - 80% depending on which city you are looking at. Loma Prieta took down part of the Bay Bridge and the freeway system, and was a huge deal back in 1989. But look at a USGS map of earthquake probabilities, against a map of data centers in the Bay Area.

Data Center Selection

Most of the data center space in the Bay Area is in the worst place possible from an earthquake perspective. Not only is the probability high of another large quake, but the land the data centers sit on is subject to liquefaction. Basically, the sand underneath the buildings acts like a fluid during a quake. Even if the data center itself survives, will the surrounding electrical grid, water supply, roads, fuel vendors, etc. continue to function?

Earthquakes are just one of the many risks that data centers must deal with. When choosing a data center, looking at all the risks associated with the facility is key. The Uptime Institute has published a helpful guide to the natural disaster risk profiles for data center locations. A detailed discussion of the variety of risk factors data centers face is available at datacenterlinks.com.RagingWire’s facilities in Sacramento, California and Ashburn, Virginia are both in locations with low composite risk scores according to the Uptime Institute. For example, while the probability of a Loma Prieta sized quake hitting Bay Area data centers is higher than 50%, the probability of a quake of that size hitting RagingWire in Sacramento is less than .03%.

Risk, not proximity, should be the driving factor in your data center location criteria. RagingWire has a variety of innovative services, like our unlimited remote hands service, that can less the need for proximity to be a determining factor. RagingWire lets you choose the best data center, not just the best data center within driving distance of your office.

Subscribe to RSS - Bill Dougherty's blog