Cloudburst: Hard lessons learned from the OVH datacenter blaze

Enterprise

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.


In every tabletop disaster-recovery exercise in every enterprise IT shop, there’s a moment when attention grudgingly shifts from high-profile threats — malicious intrusion, data theft, ransomware — to more mundane (and seemingly less likely) threats, like natural disasters, accidents, and low-tech turmoil.

What hurricanes, explosions, earthquakes, fires, and floods lack in cybersecurity panache, they often make up for in ferocity. The history is clear: CIOs need to put more emphasis on force majeure — an act of God or moment of mayhem that threatens data availability at scale — when making their plans.

On Christmas Day 2020, a bomb packed into an RV decimated a section of downtown Nashville, Tennessee. The collateral damage included a crippled AT&T transmission facility, which disrupted communications and network traffic across three states and grounded flights at Nashville International Airport. Outages for business clients and their customers lasted through the rest of the holiday season.

This week brought even more stark evidence of the disruptive power of calamity. One of Europe’s largest cloud hosting firms, OVH Groupe SAS, better known as OVHCloud, suffered a catastrophic fire at its facility in Strasbourg, France. The blaze in a cluster of boxy, nondescript structures — actually stacks of shipping containers repurposed to save on construction costs — completely destroyed one of OVH’s four datacenters at the site and heavily damaged another.

OVH officials were quick to sound the alarm, with founder and chair Octave Klaba warning that it could take weeks for the firm to fully recover and urging clients to implement their own data recovery plans.

Assuming they had them. Many did not.

Scarcely protected data remains a significant problem for businesses of all stripes and sizes. In 2018, Riverbank IT Management in the U.K. found that 46% of SMEs (small and mid-size enterprises) had no plan in place for backup and recovery. Most companies (95%) failed to account for all of their data, on-premises and in the cloud, in whatever backup plans they did have.

The results of such indiscretion are costly. According to Gartner, data-driven downtime costs the average company $300,000 per hour — that’s $5,600 every minute. The destruction at the OVH facility on the banks of the Rhine near the German border took down 3.6 million websites, from government agencies to financial institutions to computer gaming companies, many of which remain dark as of this writing. Affected complained on blogs and social media that years’ worth of data was lost for good in the OVH conflagration. The final financial tally will be staggering.

“Not all data catastrophes are caused by a hoodie-wearing, Eastern European hacker,” said Kenneth R. van Wyk, president and principal consultant at KRvW Associates, a security consultancy and training company in Alexandria, Virginia. “Some are caused by the most mundane circumstances.”

“Sure, we need to consider modern security threats like ransomware, [but] let’s never forget the power of a backhoe ripping through a fiber optic line feeding a business-critical datacenter.”

“It’s about a mindset of always expecting the worst,” van Wyk said. “Security professionals look at systems and immediately ask ‘What could go wrong?’ Every business owner should do the same.”

In this age of ubiquitous cloud migration and digital transformation, what can IT leadership do to gird the organization against hazards large and small? The answer lies within the realm of business continuity and disaster recovery (BCDR). This well-codified discipline in information security is a critical, but often missing, piece in enterprise risk management and mitigation. Most organizations understand the basic rules of engagement when it comes to BCDR, but security experts agree that execution often lacks rigor and commitment.

“As a CIO, I’d immediately ask, ‘Have we truly tested our backups and recovery capability?’” said cloud security specialist Dave Shackleford, founder and principal consultant at Voodoo Security in Roswell, Georgia. “Whether cloud-based or not, too many organizations turn disaster recovery and business continuity planning and testing into ‘paper exercises’ without really ensuring they’re effective.”

For organizations looking to protect key digital assets, what Shackleford deems an effective BCDR approach begins with a few time-tested best practices.

Start with the provider

Ask about redundancy and geographic resilience — and get it in writing. Losing two cloud datacenters will always result in disruption and downtime, even for a host like OVH with 300,000 servers in 14 facilities across Europe and 27 worldwide. But how painful and protracted that loss is will largely depend on the robustness of the hosting company’s own backup and fail-over protocols.

The assurances, as spelled out in the service-level agreement (SLA), must also go beyond data processing and storage. A big part of Roubaix-based OVH’s troubles stemmed from the failure of backup power supplies that damaged its own custom-built servers — even in areas unaffected by the actual fire.

Look for items in the SLA that address not only the service guarantee but also the eligibility for compensation and level of compensation offered. Offering “five-nines” availability is great, but the host should also demonstrate a commitment to diverse transit connections; multiple sources of power; redundant networking devices; and multiple, discrete storage assets on the backend.

Get your own house in order

Holding your cloud host accountable is a solid start, but it’s important to remember that, as the OVH experience casts in stark relief, enterprise-grade cloud is not some mythical realm of infinite resources and eternal uptime. Moving important digital assets to the cloud means swapping your own infrastructure for that of another, for-profit vendor partner.

The first requirement for cloud migration is to establish a framework for determining the wisdom and efficacy of making such a move to the cloud in the first place. Then there needs to be a comprehensive plan in place to protect everything the organization holds dear.

“Inventory all your critical assets,” van Wyk suggests. “Ask how much it would cost you if any of them were unavailable, for any reason, for an hour, a day, a week. Ask how you would restore your business if everything in your inventory vaporized. What would the downtime be? Can you afford that? What is your Plan B?”

The Cloud Security Alliance offers excellent guidance when preparing, analyzing, and justifying cloud projects with an eye toward risk, particularly with its Cloud Controls Matrix (CCM).

If third-party hosting is warranted, it should be guided by formal policy that covers issues such as:

  • Definitions for systems, data types, and classification tiers that can be accounted for in a risk assessment
  • Graduated internal policies and standards attached to each classification tier
  • Application and security requirements
  • Specific compliance/regulatory requirements
  • And a BCDR plan that covers all assets entrusted to all third-party providers

Create fireproof backup

Understand that failures are going to happen. Backup and recovery is so fundamental to the security triad of data confidentiality, integrity, and availability (CIA) that it enjoys its own domain in the NIST Cybersecurity Framework. NIST’s CSF encourages organizations to ensure that “recovery processes and procedures are executed and maintained to ensure timely restoration of systems or assets affected by cybersecurity incidents.”

There’s a lot going on in that sentence, to be sure.

Developing a robust approach to recovery that can satisfy NIST and withstand a catastrophic event like the OVH fire takes more than scheduling some automated backups and hoping for the best.

Van Wyk said it’s a good idea to take extra precautions with your vital business data and processing and ensure you will actually be able to use your backup plans in different emergency scenarios.

Whether organizations’ crown jewels live on-premises, in a hybrid environment, or solely in the cloud, a mature and pragmatic BCDR approach should include:

  1. Making it formal. A real, effective disaster-recovery plan must be documented. Putting the plan in writing, to include the who, what, where, when, and how of it all helps organizations quantify required actions for preventing, detecting, reacting to, and solving data-loss events.
  2. Quantifying data at risk. Formal BCDR documentation is the best place to ensconce a detailed data-classification schema and a backup-specific risk register, to include a realistic rundown of threats facing the organization, the consequences of lost data of various types, and a menu of mitigations.
  3. Drafting some all-stars. A mature BCDR approach requires more than policies and processes; it demands a dedicated group of stakeholders responsible for various parts of the plan. A well-rounded disaster recovery team should represent diverse areas of the business who can assess the damage, kick-start recovery plans, and help keep disaster-recovery plans updated These are the folks who know what to do when trouble strikes.
  4. Counting on communications. A significant part of the NIST guidance on recovery demands that “restoration activities are coordinated with internal and external parties, such as coordinating centers, internet service providers, owners of attacking systems, victims, and vendors.” This requires thoughtful, advance planning to ensure communications remain open to employees, customers, law enforcement, emergency personnel, and even the media. The heat of the moment is no time to be scrambling for contact info.
  5. Testing for efficacy. Formal incident recovery exercises and tests at regular intervals are critical to BCDR success, as many of the OVH discovered to their horror. Crunch time is not the time to figure out if backups can successfully be put into production in a reasonable period. Sensible practice runs should include realistic objectives, with specific roles and responsibilities, for stress-testing the organization’s recovery capabilities.
  6. Keeping it fresh. BCDR plans should be reviewed annually to ensure they remain relevant and practical. Moreover, every trial run, every exercise, and every data-loss incident, no matter how small, is an excellent opportunity to examine lessons learned and make pragmatic improvements.

No BCDR plan can ward off all chaos and guarantee perfect protection. But as the OVH incident demonstrates, half-hearted policies and incomplete protocols are about as effective as no plan at all. Establishing a solid BCDR posture requires meaningful investment in resources, time, and capital. The payoff comes when the lights flicker back on and rebooted systems go back online, data intact and none the worse for the experience.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member