As organizations and their data increasingly move to the cloud and large scale distributed databases, one particular challenge that has emerged is how to actually backup and recover all that data. With a distributed cloud-scale database, such as MongoDB or Cassandra, data is made highly available across multiple nodes, which is great for data resiliency in the cloud, but makes backup somewhat challenging.
Tarun Thakur, co-founder and CEO of startup Datos IO, is aiming to solve the cloud distributed database challenge with his company's RecoverX platform. The primary innovation of RecoverX is that it has what Thakur referred to as Consistent Orchestrated Distributed Recovery (CODR), that can backup an organization data from a distributed cloud database. The initial generally available iteration of RecoverX supports Apache Cassandra (v2.0, v2.1), DataStax DSE (v4.5, v4.6, v4.7, v4.8) and MongoDB (v3.0, v3.2).
Datos IO first existed its stealth mode in September 2015, announcing that it had raised $12.5 million in a Series A round of funding that included the participation of Lightspeed Venture Partners and True Ventures.
With MongoDB and Cassandra, both database technologies include elements that enable high-availability and even data redundancy, though Thakur noted that neither by default have features for true point-in-time backup.
"With RecoverX we enable a true point-in-time backup that is cluster consistent," Thakur said.
With a multi-node database, where data is distributed widely, being able to get a consistent database point-in-time image requires the use of a technology methodology in the industry known as distributed consensus. Specifically, Datos IO is using the Raft distributed consensus model that was originally developed by Salesforce lead software engineer Diego Ongaro and is widely used in cloud systems today including Google Kubernetes.
"MongoDB and Cassandra enables a masterless distributed data architecture," Thakur explained. "As such, the data versioning has to be cluster level."
That cluster level of data versioning for backup also applies to de-duplication of data, so as not to have multiple version of the same data in a backup. To that end, RecoverX enables semantic de-duplication, which is able to take the data from multiple nodes of a distributed cloud scale database and then create a single golden backup. Thakur said that the RecoverX semantic de-duplication is able to reduce backup storage space requirements by approximately 70 percent. He added that once an initial backup is done, RecoverX continues to enable storage savings, by only doing incremental backups of data that has changed in a cluster.
From a storage technology perspective, Thakur explained that RecoverX enables software defined storage, whereby the organization chooses where they want the data to be stored. Options include cloud storage platforms such as Amazon S3 and Google cloud, as well as traditional NFS (Network File System) based storage systems.
Sean Michael Kerner is a senior editor at Datamation and InternetNews.com. Follow him on Twitter @TechJournalist