by Hari Mankude
A recent Forrester survey indicated that spending on big data platforms will grow at twice the rate of spending on traditional IT categories, with platforms like Hadoop and NoSQL databases seeing even greater rates of investment. Much like the way data management capabilities (backup, test data management, archiving, etc.) appeared over twenty years ago in the traditional database world, these same capabilities are critical for these modern data platforms in order to prevent data loss, deliver applications faster, and minimize compliance risk.
There is a myth that data replicas present in these big data platforms eliminate the rationale for backup and other data management capabilities. However, human errors or application corruption get immediately propagated across these replicas making data loss a real possibility. Further, there is significant strain put on IT resources as they balance the need between storage costs and capacity planning, impacting how companies create optimal secondary storage policies.
These data management capabilities by definition rely on a robust secondary storage environment. But, the three Vs of big data – volume, variety and velocity – make these secondary storage requirements much different than what is the case for traditional databases. For example, data stored on Hadoop and in NoSQL databases like Cassandra is typically compressed to reduce the storage footprint, creating an extra burden on companies to decompress the data prior to applying storage optimization techniques.
A description of four of the most relevant differences follows.
Requirement #1: Scaling Your Secondary Data Management Architecture
Most big data platforms are deployed on commodity x86 hardware or VMs with direct attached disks and possess a highly elastic architecture where nodes and drives can be added or decommissioned very easily. The sheer volume of the data sets involved in big data workloads implies that your secondary data management architecture also needs to scale in order to store petabytes of data with as many restore points as your business dictates. A software-defined architecture provides this type of flexibility, with the ability to deploy on hardware (physical or virtual) of your choice. Typical big data applications grow by simply adding nodes to a cluster. A secondary storage environment needs to grow the same way by simply adding commodity nodes that scale to handle the growth of your primary system.
Requirement #2: Storage Reduction and the Need For Application-Awareness
Big Data platforms are unique in their structure and employ different types of compression algorithms or compressed columnar formats to store data efficiently. As a result, storage optimization on your secondary storage has to be application-aware to be effective. Take de-duplication as an example. In the big data world, effective de-duplication has to go beyond block-level techniques that involve just finding duplicates at a byte stream granularity. Instead, de-duplication algorithms have to be “smart” enough to understand the “semantic” differences between the data formats of, say, Cassandra keyspaces versus Hive schema. Using the notion of “data-awareness”, de-duplication algorithms can uncompress the data and then apply suitable algorithms so that all the replicas can be removed from the data stream. Only then can de-duplication truly impact your secondary storage footprint. The unique data should be stored in a compressed format.
Requirement #3: Support Storage Target Flexibility
Companies can have multiple secondary storage environments in their mix. Your data management architecture will need to support a storage infrastructure than can support direct-attached commodity storage, NAS, SAN or even potentially federate older data to cheaper cloud-based storage targets like Amazon S3 or Microsoft Azure blob storage. Again, your data management architecture should not just store data across these different targets but provide the ability to quickly and easily restore to minimize the impact of downtime and data loss.
Requirement #4: Storage + Network Costs Make An Incremental-Forever Architecture Critical
A traditional approach to backing up data in a big data environment is not economically or logistically feasible. Take the traditional backup mechanism that incorporates weekly full backups with daily incrementals. On a 100 TB production big data environment that has a 5% change rate, you would move over 550 TB a month. In an incremental-forever architecture, you would do the full 100 TB backup just once, the first time the workflow is run. All the subsequent backups would identify just the new modifications (additions, deletions, mutations) and move the changed data and metadata. The same efficient approach should apply to the recovery process. The traditional approach would involve finding the last full backup and applying all the relevant incrementals to it to create the final image. Instead, a far more efficient approach to reduce your Recovery Time Objective (RTO) would be to have a fully materialized restore image that can be directly recovered.
Big Data Storage Conclusion
Big data platforms are not just here to stay, they are increasingly important in enterprise architectures. As companies realize that protecting these data assets is synonymous with business success, their attention will rightly turn to how best to architect a secondary storage environment that fully supports these data management needs without losing sight of the overall cost of ownership, the flexibility needed to work in diverse architectural environments, and the ability to scale at big data levels.
About the Author:
Hari Mankude, Co-founder and CTO, Talena, Inc.