SHARE

Virtual Failover in the Cloud: Challenge Abound

Failover to a remote location is a mature technology. So is cloud storage. But when users want to failover their virtual environments to the cloud, they can face distinct challenges. Although both processes use replication, cloud failover is much more than replicating backup to the cloud for later recovery. The failover process uses the cloud […]

Written By

Christine Taylor

Mar 26, 2015

9 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Failover to a remote location is a mature technology. So is cloud storage. But when users want to failover their virtual environments to the cloud, they can face distinct challenges.

Although both processes use replication, cloud failover is much more than replicating backup to the cloud for later recovery. The failover process uses the cloud as a secondary DR site. Standby servers take over the processing of a failed VM environment for uninterrupted application performance, then fails back to the primary data center when the event is solved. Failover to the cloud may be automated or manual; both have advantages and disadvantages.

Let’s define some specifics. We’re talking about virtual-to-virtual here. It’s technically possible to failover on-premise physical servers to physical servers in the cloud using bare metal recovery (BMR) technology. But it’s impractical. Few (if any) cloud DR vendors support it because they are based on virtual server technology. VM architecture allows users to avoid the issue of maintaining identical hardware in the secondary data center, which is a huge part of the cloud-based DR value proposition.

We’ll also discuss failover in the context of public clouds. Although failover is certainly possible in company-owned private clouds, it defeats the purpose of simple scalability that the public cloud offers.

What You Need to Know

Why is failover to remote sites a mature technology while failing over to the cloud is not? The cloud itself is the difference.

It is undeniably attractive for its scalability and economics, and once the failover site is tested and complete it can be relatively simple to maintain. With virtual failover you do not need maintain nearly identical hardware like you must in a remote site, and you gain near-infinite scalability. However, there are also real challenges in failing over production data to the cloud.

Maintaining Service Levels

Backup data alone is pretty low-risk. Public cloud reliability is very high, and availability is high and improving thanks to distributed operations. But when it comes to critical business applications, cloud storage risks scale up from the BUR. Thanks to sluggish data movement over the Internet, remote failover to the cloud for virtualized production storage with acceptable RTO and RPO is fairly new.

Backing up server images to the cloud is pretty simple if you have the necessary bandwidth. But running those applications in the cloud in a failover scenario is a different kettle of fish. To begin with, you will need separate failover domains for VMware and Hyper-V. You might need separately configured domains for specific applications too in order to provide proper service levels for failed over applications.

Test your applications before trusting them to the cloud DR site. Amazon, Google, Azure and other large public clouds are capable of offering the performance need (at a price) but you will need to test your bandwidth and configurations.

Invest in Bandwidth

Bandwidth plays a critical part in using the cloud as a DR site. Virtualized data centers produce large snapshots and a lot of them. Efficiently managing your snapshots is key to efficiently managing a failover DR site in the cloud, especially if you are looking at a cloud gateway product to accelerate data transport times. They can work very well in lower traffic environments but can bottleneck in high-volume replication environments.

Whether you use a cloud gateway or not, only replicate delta-level changes and practice dedupe and compression. You will also need to avoid continuous snapshot replication if your service levels allow. Continuous or near-continuous snapshot replication is a drain on LAN resources not to mention on Internet pipes. In any case, effective snapshot algorithms are a must-have for successful cloud-based failover.

Security and Availability

Another challenge is security. Securing backup and archive data in the cloud is important; securing and accessing production data is a lot more so. You need both reliability and availability: reliability in that your cloud provider isn’t going to lose your data; availability in that you can access your data when you need to. Work out your service levels with your provider. You’ll be paying more than simple BUR but you don’t want to mess with success when it comes to applications.

Do your due diligence on encryption levels and make encryption decisions for data-at-rest (which you probably need) and data-in-flight (which you may or may not need). Also watch out for multi-tenant issues. The public cloud is a massively scaled multi-tenant environment. One risk is performance degradation if other tenants unexpectedly consume massive resources. The last thing you want is someone else’s surprise consumption grabbing your resources just as your applications launch from your cloud DR site. Understand how your public cloud provider and your DR vendor protect you from other tenants and from system failures.

Another potential issue is with automated failover. Automating DR, while in general a best practice for critical DR, is not a magic bullet because of the so-called split-brain event. This occurs when an error at the VM level triggers automated failover, even though the VM was not in fact in a failure state. In 2015 automated failover to the cloud is better at monitoring paths and events but it is still an issue to be aware of. For many cases, an immediate alert to an IT team should a VM fail might be a better solution than automated-only.

The Dynamic Cloud

The cloud is a dynamic environment, yet successful failover depends on users being able to find the ported application and its data. One vendor development choice is to use cloud-based clusters as the failover DR site.

MS Windows Server uses the clustering method as a proven DR technology between on-premise and remote sites. However, Windows-based clustering needs access to Active Directory. This means that IT will need to extend AD to the cloud, which requires ongoing synchronization between the network and cloud AD versions.

The more common development technique is replicating VMs and their data to the cloud so that users are transparently redirected to the cloud should the on-premise environment fail. The drawback to this architecture is resolving IP address and DNS record changes to accommodate the changed production site.

These days most service providers and vendors propagate changes for you or provide tools to do so more easily. For example, Amazon Route 53’s DNS web service automates both types of changes for developers and users, making it easier to perform failover processes within the cloud. Another way to solve the addressing issues is newer vendors who built their cloud-based DR offering from the ground up. Zadara with its Virtual Private Storage Array (VPSA) uses the public cloud to provide enterprise-level DR services on AWS and other cloud providers, and automates dynamic address changes

Why Bother? Because It’s Worth It

When you get the setup and service levels right, virtual failover to the cloud is an excellent DR option. Even with the complexity of initial setup and testing, it’s easier than leasing a remote site and physically building a second data center, not to mention the hassle and risk of keeping hardware and software essentially identical. Instead you’ll be replicating to a highly flexible and dynamically scaled environment; not a small consideration for anyone who has tried to keep two data centers in lockstep.

You’ll probably want to invest in higher bandwidth, or at the least invest in products that give you bandwidth optimization techniques – ideally you will invest in both. However, once you have made the additional investment then ongoing costs can be quite reasonable. In addition to avoiding the expense of creating and maintaining the secondary data center, you do not have to pay for staff at the secondary data center. And you can free up existing IT staff to do different high value projects.

Management may be similar to what you are used to. If you are already using VMware or Hyper-V tools to replicate to a secondary data center, you can use the same tools to replicate to the cloud. The same thing is true of third-party products since they will preserve as much as possible of familiar hypervisor console and toolsets.

Hyper-V, for example, uses Azure-centric Hyper-V Replica with Azure Site Recovery Manager to replicate and failover VMs in Virtual Machine Manager (VMM) clouds within Azure. Hyper-V Recovery Manager (HRM) automates more of this process. VMware offers Site Recovery Manager (SRM); its newer public cloud option recovery is VMware vCloud Air Disaster Recovery. Unlike SRM, Air DR provides native cloud-based DR for VMware vSphere. vCloud Air DR is built on vSphere Replication’s asynchronous replication and failover.

Not Just for DR

Drivers for cloud-based failover vary. DR is the biggest driver but data migration, test/dev and additional processes also benefit.

· VM migration. The process also works for planned processes like VM migration. A Nutanix user reported that they used Nutanix Cloud Connect as a failover site for virtualized web app migrations. Nutanix manages BUR, DR and test/dev in the public cloud using Nutanix Prism and Cloud Connect. The cloud-based Controller VM (CVM) cluster operates exactly like a remote cluster. Data moves from the on-premise cluster to the cloud accordingly.

A few days in advance of the planned migration, the user transferred all affected applications and data to the cloud by manually shutting down the VMs, waiting for the automated failover to complete, then activating the cloud cluster. They then restored the applications and data to the new environment when they were ready.

· DR tests. DR tests are traditionally awkward, unrealistic, and time-consuming, which is why companies rarely test their DR plans. With failover in the cloud, IT can easily test failover procedures and recovery times without committing to an identical remote data center. Zerto Virtual Replication is a hypervisor-based replication product that supports large-scale DR and testing in the cloud as well as automated failover and failback. Unitrends Reliable DR manages and automates application-specific testing for multi-VM applications and guarantees failover in virtualized production environments.

· Bare Metal Recovery (BMR). Virtualization in the cloud can also aid in bare metal recovery (BMR). BMR is the process of restoring an identical system in case of failure; all the way from an OS, drivers, applications and production data. Physical BMR requires an identical hardware environment for error-free restores; otherwise you’re going to see serious errors. In virtual environments, vendors like Zetta.net can recover a VM image to spin up bare metal. This makes for a much more efficient and less error-prone BMR procedure.

Given all of its attendant issues, is cloud-based failover worth researching and investing in? For many companies, yes; but not all. If you have a remote DR setup that is working for you there is no need to abandon it. This is certainly the case if your company owns multiple data centers and your have replication and DR setup between them.

However, even then IT might consider testing cloud-based DR for a pilot project in a virtualized server environment. Virtual networks are growing very fast and they throw off a lot of data. The scalability of the cloud offers real advantages in these specific environments.

Photo courtesy of Shutterstock.