3. Management. There are a million and one storage management tools out there. The most basic one – and one still in wide use even in business, believe it or not, is a simple Excel spreadsheet – but vendors from EMC to Hitachi Data Systems to NetApp offer solid storage management solutions. The trouble is, though, that data-sharing standards are still lacking and escaping vendor-lock is a never-ending challenge.
4. The WAN. As cloud computing becomes mainstream, the simplest way to break down data silos is to leverage the cloud to help with everything from search to backups to raw processing. However, as more storage moves into the cloud, the more the WAN will impede on Big Data progress. The WAN, unfortunately, isn’t keeping up with Moore’s Law, nor with the storage-specific analog Kryder’s Law. Any Big Data storage solution must include some combination of redundant MPLS links, WAN optimization and CDN services.
5. Security. As you break down data barriers, certain people may get access to data (say HR records) that they should never, ever see. Thus, authentication, access and security in general are a major Achilles heel of Big Data storage.
Since traditional databases, such as SQL, weren’t designed with Big Data in mind, eventually a Big Data alternative emerged: Apache Hadoop.
According to the Apache Software Foundation, Hadoop is a “framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
“Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
Key modules to consider, as far as Big Data storage is concerned, include HDFS, a distributed file system to access application data; the Hive data warehouse infrastructure; and Chukwa, a data collection system for managing large distributed systems.
Big Data storage is quickly becoming a subset of cloud storage. As data centers are virtualized, and as more data is moved into third-party data centers, Big Data and cloud storage challenges (and opportunities) will begin to merge.
Granted, not all applications will move off-site, nor will every single application be one open to Big Data sharing. However, as security and access rights solutions strengthen, don’t be surprised if nearly every application under the sun is able to share data with nearly every other one – in an ideal world. Of course, standards fights will brew, vendors will do their best to lock customers into their solutions and problems like data loss and IP theft will undermine this ideal world, but it will be feasible from a technical standpoint.
Most enterprises will get their cloud storage feet wet with data backups. Eventually, they will use APIs to connect their on-premise data repositories with cloud and SaaS services, such as Salesforce.com, and cloud storage will evolve into Big Data storage. Further out, as most infrastructure moves into the cloud, various cloud providers will offer an array of Big Data storage options as a service.
Along the way, Flash and SSD (Solid-State Drives) may make disk drives obsolete; in-memory storage could break out of its Java purgatory, and biologists working with the human genome may well provide storage insights derived from DNA and gene sequencing.
Note: This is by no means an exhaustive list, and placing vendors in one category or another is largely subjective.
EMC (Key Big Data acquisitions: Greenplum and Isilon)
IBM (Key acquisitions: Cognos, Netezza, OpePages, Algorithmics, Texas Memory Systems)
Oracle (Key acquisition: Endeca)
HP (Key acquisitions: Autonomy and Vertica)
Cisco (Key acquisition: Truviso)
Dell (Key acquisition: Compellent)
Photo courtesy of Shutterstock.