Failover to a remote location is a mature technology. So is cloud storage. But when users want to failover their virtual environments to the cloud, they can face distinct challenges.
Although both processes use replication, cloud failover is much more than replicating backup to the cloud for later recovery. The failover process uses the cloud as a secondary DR site. Standby servers take over the processing of a failed VM environment for uninterrupted application performance, then fails back to the primary data center when the event is solved. Failover to the cloud may be automated or manual; both have advantages and disadvantages.
Let’s define some specifics. We’re talking about virtual-to-virtual here. It’s technically possible to failover on-premise physical servers to physical servers in the cloud using bare metal recovery (BMR) technology. But it’s impractical. Few (if any) cloud DR vendors support it because they are based on virtual server technology. VM architecture allows users to avoid the issue of maintaining identical hardware in the secondary data center, which is a huge part of the cloud-based DR value proposition.
We’ll also discuss failover in the context of public clouds. Although failover is certainly possible in company-owned private clouds, it defeats the purpose of simple scalability that the public cloud offers.
What You Need to Know
Why is failover to remote sites a mature technology while failing over to the cloud is not? The cloud itself is the difference.
It is undeniably attractive for its scalability and economics, and once the failover site is tested and complete it can be relatively simple to maintain. With virtual failover you do not need maintain nearly identical hardware like you must in a remote site, and you gain near-infinite scalability. However, there are also real challenges in failing over production data to the cloud.
Maintaining Service Levels
Backup data alone is pretty low-risk. Public cloud reliability is very high, and availability is high and improving thanks to distributed operations. But when it comes to critical business applications, cloud storage risks scale up from the BUR. Thanks to sluggish data movement over the Internet, remote failover to the cloud for virtualized production storage with acceptable RTO and RPO is fairly new.
Backing up server images to the cloud is pretty simple if you have the necessary bandwidth. But running those applications in the cloud in a failover scenario is a different kettle of fish. To begin with, you will need separate failover domains for VMware and Hyper-V. You might need separately configured domains for specific applications too in order to provide proper service levels for failed over applications.
Test your applications before trusting them to the cloud DR site. Amazon, Google, Azure and other large public clouds are capable of offering the performance need (at a price) but you will need to test your bandwidth and configurations.
Invest in Bandwidth
Bandwidth plays a critical part in using the cloud as a DR site. Virtualized data centers produce large snapshots and a lot of them. Efficiently managing your snapshots is key to efficiently managing a failover DR site in the cloud, especially if you are looking at a cloud gateway product to accelerate data transport times. They can work very well in lower traffic environments but can bottleneck in high-volume replication environments.
Whether you use a cloud gateway or not, only replicate delta-level changes and practice dedupe and compression. You will also need to avoid continuous snapshot replication if your service levels allow. Continuous or near-continuous snapshot replication is a drain on LAN resources not to mention on Internet pipes. In any case, effective snapshot algorithms are a must-have for successful cloud-based failover.
Security and Availability
Another challenge is security. Securing backup and archive data in the cloud is important; securing and accessing production data is a lot more so. You need both reliability and availability: reliability in that your cloud provider isn’t going to lose your data; availability in that you can access your data when you need to. Work out your service levels with your provider. You’ll be paying more than simple BUR but you don’t want to mess with success when it comes to applications.
Do your due diligence on encryption levels and make encryption decisions for data-at-rest (which you probably need) and data-in-flight (which you may or may not need). Also watch out for multi-tenant issues. The public cloud is a massively scaled multi-tenant environment. One risk is performance degradation if other tenants unexpectedly consume massive resources. The last thing you want is someone else’s surprise consumption grabbing your resources just as your applications launch from your cloud DR site. Understand how your public cloud provider and your DR vendor protect you from other tenants and from system failures.
Another potential issue is with automated failover. Automating DR, while in general a best practice for critical DR, is not a magic bullet because of the so-called split-brain event. This occurs when an error at the VM level triggers automated failover, even though the VM was not in fact in a failure state. In 2015 automated failover to the cloud is better at monitoring paths and events but it is still an issue to be aware of. For many cases, an immediate alert to an IT team should a VM fail might be a better solution than automated-only.
The Dynamic Cloud
The cloud is a dynamic environment, yet successful failover depends on users being able to find the ported application and its data. One vendor development choice is to use cloud-based clusters as the failover DR site.