by Hari Mankude
A recent Forrester survey indicated that spending on big data platforms will grow at twice the rate of spending on traditional IT categories, with platforms like Hadoop and NoSQL databases seeing even greater rates of investment. Much like the way data management capabilities (backup, test data management, archiving, etc.) appeared over twenty years ago in the traditional database world, these same capabilities are critical for these modern data platforms in order to prevent data loss, deliver applications faster, and minimize compliance risk.
There is a myth that data replicas present in these big data platforms eliminate the rationale for backup and other data management capabilities. However, human errors or application corruption get immediately propagated across these replicas making data loss a real possibility. Further, there is significant strain put on IT resources as they balance the need between storage costs and capacity planning, impacting how companies create optimal secondary storage policies.
These data management capabilities by definition rely on a robust secondary storage environment. But, the three Vs of big data – volume, variety and velocity – make these secondary storage requirements much different than what is the case for traditional databases. For example, data stored on Hadoop and in NoSQL databases like Cassandra is typically compressed to reduce the storage footprint, creating an extra burden on companies to decompress the data prior to applying storage optimization techniques.
A description of four of the most relevant differences follows.
Requirement #1: Scaling Your Secondary Data Management Architecture
Most big data platforms are deployed on commodity x86 hardware or VMs with direct attached disks and possess a highly elastic architecture where nodes and drives can be added or decommissioned very easily. The sheer volume of the data sets involved in big data workloads implies that your secondary data management architecture also needs to scale in order to store petabytes of data with as many restore points as your business dictates. A software-defined architecture provides this type of flexibility, with the ability to deploy on hardware (physical or virtual) of your choice. Typical big data applications grow by simply adding nodes to a cluster. A secondary storage environment needs to grow the same way by simply adding commodity nodes that scale to handle the growth of your primary system.
Requirement #2: Storage Reduction and the Need For Application-Awareness
Big Data platforms are unique in their structure and employ different types of compression algorithms or compressed columnar formats to store data efficiently. As a result, storage optimization on your secondary storage has to be application-aware to be effective. Take de-duplication as an example. In the big data world, effective de-duplication has to go beyond block-level techniques that involve just finding duplicates at a byte stream granularity. Instead, de-duplication algorithms have to be “smart” enough to understand the “semantic” differences between the data formats of, say, Cassandra keyspaces versus Hive schema. Using the notion of “data-awareness”, de-duplication algorithms can uncompress the data and then apply suitable algorithms so that all the replicas can be removed from the data stream. Only then can de-duplication truly impact your secondary storage footprint. The unique data should be stored in a compressed format.
Requirement #3: Support Storage Target Flexibility
Companies can have multiple secondary storage environments in their mix. Your data management architecture will need to support a storage infrastructure than can support direct-attached commodity storage, NAS, SAN or even potentially federate older data to cheaper cloud-based storage targets like Amazon S3 or Microsoft Azure blob storage. Again, your data management architecture should not just store data across these different targets but provide the ability to quickly and easily restore to minimize the impact of downtime and data loss.
Requirement #4: Storage + Network Costs Make An Incremental-Forever Architecture Critical
A traditional approach to backing up data in a big data environment is not economically or logistically feasible. Take the traditional backup mechanism that incorporates weekly full backups with daily incrementals. On a 100 TB production big data environment that has a 5% change rate, you would move over 550 TB a month. In an incremental-forever architecture, you would do the full 100 TB backup just once, the first time the workflow is run. All the subsequent backups would identify just the new modifications (additions, deletions, mutations) and move the changed data and metadata. The same efficient approach should apply to the recovery process. The traditional approach would involve finding the last full backup and applying all the relevant incrementals to it to create the final image. Instead, a far more efficient approach to reduce your Recovery Time Objective (RTO) would be to have a fully materialized restore image that can be directly recovered.
Big Data Storage Conclusion
Big data platforms are not just here to stay, they are increasingly important in enterprise architectures. As companies realize that protecting these data assets is synonymous with business success, their attention will rightly turn to how best to architect a secondary storage environment that fully supports these data management needs without losing sight of the overall cost of ownership, the flexibility needed to work in diverse architectural environments, and the ability to scale at big data levels.
About the Author:
Hari Mankude, Co-founder and CTO, Talena, Inc.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.