Big Data Protection in the Age of Machine Learning

Machine learning techniques could make data protection more efficient and effective.

The concept of machine learning has been around for decades, primarily in academia. Along the way it has taken various forms and adopted various terminologies, including pattern recognition, artificial intelligence, knowledge management, computational statistics, etc.

Regardless of terminology, machine learning enables computers to learn on their own without being explicitly programmed for specific tasks. Through the use of algorithms, computers are able to read sample input data, build models and make predictions and decisions based on new data. This concept is particularly powerful when the set of input data is highly variable and static programming instructions cannot handle such scenarios.

In recent years, the proliferation of digital information through social media, the Internet of Things (IoT) and e-commerce, combined with accessibility to economical compute power, has enabled machine learning to move into the mainstream. Machine learning is now commonly used across various industries including finance, retail, healthcare and automotive. Inefficient tasks once performed using human input or static programs have now been replaced by machine learning algorithms.

Here are a few examples:

Fraud Detection

Prior to the use of machine learning, fraud detection involved following a set of complex rules as well as following a checklist of risk factors to detect potential security threats. But with the growth in the volume of transactions and the number of security threats, this method of fraud detection did not scale. The finance industry is now using machine learning to identify unusual activity and anomalies and reporting those to the security teams. PayPal is also using machine learning to compare millions of transactions to identify fraudulent and money laundering activity.

Recommendation Engines

Without machine learning, recommendations on product purchases and which movies to watch were mainly by word of mouth. Companies like Amazon and Netflix changed that by adopting machine learning to make recommendation to their customers based on data they had collected from other similar users. Using machine learning to recommend movies and products is now fairly common. Intelligent machine learning algorithms analyze your profile and activity against the millions of other users they have in their database and recommend products that you are likely to buy or movies that you may be interested in watching.

Machine Learning Meets Big Data Protection

For all its increased popularity and use, machine learning still hasn’t yet made its way into any part of data protection, and that is being acutely felt in big data. Specifically, backup and recovery for NoSQL databases (Cassandra, Couchbase, etc.), Hadoop, and emerging data warehouse technologies (HPE Vertica, Impala, Tez, etc.) is a very manual process with a lot of human interaction and input. It is quite a paradox that these big data platforms are used for machine learning while the underlying data protection processes supporting these platforms rely on human intervention and input.

For example, an organization may have a defined recovery point objective (RPO) and recovery time objective (RTO) for a big data application. Based on those objectives, an IT or DevOps engineer determines the schedule and frequency for backing up application data. If the RPO is 24 hours, the engineer may decide to perform backups once per day starting at 11:00 p.m.

While this logically makes sense, the answer is not as simple as that, especially in a big data environment. the big data environments are often very dynamic and unpredictable. These systems may be unusually busy at 11:00 p.m., loading new data or running nightly reports and making that time least optimal for scheduling a backup.

Why can’t the data protection application recommend the best time to schedule a backup task to meet the recovery point objective?

Another common example of inefficiency in data protection relates to storing backup data. Typically, techniques such as compression and de-duplication are applied to backup data to reduce the backup storage footprint. The algorithms used for these techniques are static and follow the same mechanism independent of the type of data being dealt with. Given that big data platforms use many different compressed and uncompressed file formats (Record Columnar (RC), Optimized Row Columnar (ORC), Parquet, Avro, etc.), a static algorithm for deduplication and compression does not yield the best results.

Why can’t the data management application learn and adopt the best deduplication and compression techniques for each of the file formats?

Machine learning certainly could aid in optimizing a company’s data protection processes for big data. All pertinent data needs to be collected and analyzed dynamically using machine learning algorithms. Only then will we be able to do efficient, machine-driven data protection for big data. The question is not if but when!

By Jay Desai, VP, product management, Talena, Inc.

Photo courtesy of Shutterstock.

Tags: data protection, big data, machine learning

0 Comments (click to add your comment)
Comment and Contribute


(Maximum characters: 1200). You have characters left.