Monday, December 2, 2024

Are Big Data Vendors Forgetting History?

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

With any new hot trend comes a truckload of missteps, bad ideas and outright failures. I should probably create a template for this sort of article, one in which I could pull out a term like “cloud” or “BYOD” and simply plug in “social media” or “Big Data.”

When the trend in question either falls by the wayside or passes into the mainstream, it seems like we all forget the lessons faster than PR firms create new buzzwords.

Of course, vendors within trendy news spaces also tend to think they’re in uncharted waters. But in fact there’s actually plenty of history available to learn from. Cloud concepts have been around at least since the 1960s (check out Douglas Parkhill’s 1966 book, The Challenge of the Computer Utility, if you don’t believe me), but plenty of cloud startups ignored history in favor of buzz.

And it’s not like gaining insights from piles of data is some new thing that was previously as rare as detecting neutrinos from deep space.

Here are five history lessons we should have already learned, but seem to be doomed to keep repeating:

1. Small project failures portend the failure of the whole sector.

It wasn’t that long ago that every time a cloud project or company failed, some tech prognosticator would sift through the tea leaves and claim that the cloud concept itself was dead.

The same thing is happening with Big Data. According to a recent survey, 55 percent of Big Data projects are never even completed. It’s hard to achieve success if you don’t even finish what you started, yet many mistakenly believe that this means Big Data is bunk.

Not true. Plenty of companies are reaping the rewards of Big Data, analyzing piles of data to improve everything from marketing and sales to fraud detection.

“It reminds me of the Moneyball craze during the early 2000’s, when Major League Baseball teams started to figure out that statistics could be used to build a winning ball club, rather than relying on a scout’s stopwatch and gut,” noted Matt Fates, a partner with Ascent Venture Partners. “There was initial backlash against the ‘stat geeks,’ but today every team has an advanced statistics department that helps general managers make better decisions. This was bringing data, and insights, to bear on decisions in a way that turned conventional wisdom on its head. It was not ‘big data’, but it led to big changes. It never would have started had one GM not been open-minded about statistics. His success forced others to follow.”

Of course, some of the confusion stems from how indiscriminately the term Big Data is thrown around, since most of us don’t need Big Data per se, but rather just data analytics, which leads us to the second history lesson everyone is failing to recall:

2. Imprecise terminology can poison the trend well.

People mean many different things when they use terms such as “cloud” and “Big Data.” Are you talking about virtualized infrastructures when you say cloud? Private clouds? AWS? Similarly, Big Data can refer to existing pools of data, data analytics, machine learning, and on and on.

The Big Mistake with the term Big Data is that many use the term to mask vague objectives, fuzzy strategies and ill-defined goals.

Often when people use these terms loosely it’s because they not only don’t really know what the heck the terms mean in general, but they also don’t know what they mean to their particular business problems. As a result, vendors are asked for proposals that are a poor fit for an organization’s cloud or Big Data challenges.

If your CEO or CIO orders you to start investigating Big Data, your first question needs to be the most basic one: Why, specifically?

If you can’t answer that question concisely, you’re in trouble.

3. Getting sidetracked by nitpicky technical details.

If you’re the person tasked with building out a Big Data architecture, then it’s fine to focus on details that won’t matter to anyone who isn’t a data scientist.

If you’re a business user or non-data scientist, it’s best to just ignore all this noise. It’ll sort itself out soon enough. I’ve seen this phenomena repeat with everything from CDNs to storage to cloud computing and now Big Data. Engineers and product developers often fall prey to “if we build it, they will come” syndrome, ignoring the real-world pain points of potential customers in favor of hyping their technical chops.

When they fail to find real-world customers for the resulting products, they then set their sights on technical minutiae, since it couldn’t possibly be a flawed go-to-market strategy that was the problem in the first place.

Take the recent news that Facebook is making its query analysis software, Presto, open source. Is this a win for Hadoop or for SQL? Does it mark the end of Hive?

Who cares?

Okay, if you’re reading this, you’re probably an early adopter or you’ve already placed some Big Data bets, so it matters to you. But for the rest of the world, it’s not even on their radar – nor should it be.

Ryan Betts, CTO of VoltDB, a NewSQL database vendor, does care, but even he, as deeply engrossed in the minute details as anyone, recognizes that the real point of Big Data is far less granular: “Data is only valuable when you can interact with it. Data you can’t interact with? That’s just overhead. Access and interactivity need to come first.”

4. But sometimes the techy details do matter.

For every rule, there’s the exception that proves the rule, and here’s one: SQL vs. NoSQL is a fight that will have real-world ramifications. NoSQL startups have been getting a lot of attention lately, with the likes of Cloudera, 10Gen and Datameer raising significant VC funding.

However, tech giants seem to be betting against NoSQL. “As SQL relational systems first came to market, many years ago, they competed with navigation and document oriented solutions. SQL won,” Betts pointed out. “The expressiveness and the flexibility to interact with data is why SQL matters. SQL is fast. SQL scales (witness Impala, BigQuery and Facebook’s announcement today). SQL matters to the marketplace – ask any ODBC-compliant BI vendor. To date first Google, then Apache Impala, and now Facebook have announced SQL interfaces to their large volume data stores. It’s nice to see ‘NoSQL’ learning the lessons of 30 years.”

Of course, the NoSQL camp has its own arguments for why their approach is better, but the smart money looks like it’s heading in the opposite direction – for now.

5. Being a skeptic is easy, but Big Data matters.

I recently attended a panel on Big Data where one of the panelists made some sarcastic comments about Big Data not being real, since it’s typically either capitalized or in quotes – or both.

I get the joke (although it’s not a terribly funny one), but many people take these jokes seriously.

Not too long ago, there was plenty of cloud skepticism, even from people who should have known better (such as Larry Ellison, until he saw the light and hastily directed Oracle to play cloud catch-up). Now, I hear plenty of Big Data skepticism, most of which is either stems from ignorance or an urge to protect the status quo.

Granted, some of the skepticism is well-earned, since vendors in hot spaces tend to hype the crap out of even something as trivial as a UI upgrade, but Big Data is here to stay, and it’s making an impact already.

Recently, Cryptolocker, which is arguably the most effective and sophisticated piece of ransomware released to date, was kept in check through Big Data analytics.

Cryptolocker’s creators built a Domain Generation Algorithm that produces thousands of different rendezvous domains for the malware to try until it finally finds a command-and-control server. This tactic helps the malware evade static blacklists and reputation systems. Upon infecting a device, Cryptolocker must establish a connection with a command and control server to obtain an infection-specific encryption key, which is used to help the attacker receive the ransom payment later.

Using traditional detection and prevention methods, “it took about 30 days for security vendors to capture malware samples and reverse engineer them to come up with a way to contain it,” said Dan Hubbard, CTO of cyber-security service provider OpenDNS.

OpenDNS took a different approach to fend off Cryptolocker. Using Big Data analytics and predictive algorithms, OpenDNS’ Umbrella security service was able to block Cryptolocker from day one of the outbreak. The services identifies the patterns used by Cryptolocker’s Domain Generation Algorithm and predicts the malicious sites it tries to connect with. Since the OpenDNS Umbrella service monitors inbound and outbound Internet traffic, it can block outbound Cryptolocker traffic and prevent machines that are infected from having their data encrypted.

“With Big Data-powered predictive security we are able to cut off the head of Cryptolocker and then pinpoint infected machines for disinfection,” Hubbard said.

That’s one security lesson that will, hopefully, not be lost to history.

Jeff Vance is a regular contributor to many high-tech and business-focused publications, including Forbes.com, Wired, Network World, CIO, Datamation and many others. Connect with him on LinkedIn (jeffvanceatsandstorm), follow him on Twitter @JWVance, or add him to your circles on Google Plus (+jeffvance).

Photo courtesy of Shutterstock.

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Latest Articles