Also see: Hadoop and Big Data
Hadoop and Big Data analytics are popular topics, perhaps only overshadowed by security talk. Apache’s Hadoop and its other 15 related Big Data projects are enterprise-class and enterprise-ready. Yes, they’re open source and yes, they’re free, but that doesn’t mean that they’re not worthy of your attention. For businesses that want commercial support, here are 15 companies ready to serve you and your Hadoop needs.
This list of Hadoop/Big Data vendors in alphabetical order.
Key differentiators: Amazon’s Elastic Cloud, S3, and DynamoDB integration plus an expensive and flexible pay-as-you-use plan. An added bonus is that EMR plays nice with Apache Spark and the Presto distributed SQL query engine.
Amazon Elastic MapReduce (Amazon EMR) is a part of Amazon Web Services (AWS) and is a web service that allows you to manage your big data sets. Amazon EMR (EMR) promises to securely and reliably handle your big data, log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics.
Amazon’s pricing model is simple. Using the simple charge per hour rates, you can accurately predict your monthly fees, which makes it easy to budget and plan next year’s budget. Since Amazon’s cloud computing prices keep going in a southerly direction, your budget shrinks while your revenues pile up. Per hour prices range from $0.011 to $0.27 ($94/year to $2367/year), depending on the size of the instance you select and on the Hadoop distribution.
The downside of Amazon’s services is that they’re somewhat difficult to use. They’re easier to use now than they were a few years ago, but to use AWS and associated services, you will have to possess intermediate level technical skills as a system administrator to understand all of the options and how to handle key pairs and permissions.
Key differentiators: Attunity automates data transfer into Hadoop from any source and it also automates data transfers out of Hadoop, including both structured and unstructured data. Attunity has forged strategic partnerships with Cloudera and Hortonworks (Both included in this article).
It’s hard to pinpoint exactly what Attunity Replicate does for big data until you see the process in action. Replicate takes data from one platform and translates it into another. For example, if you have multiple data sources and want to combine them all into a single data set, then you’d have to struggle with grabbing or dumping the data from all your source platforms and transforming that data into your desired target platform. You might have sources from Oracle, MySQL, IBM DB2, and SQL Server and your target is MySQL.
Attunity’s Click2Replicate allows you to graphically select your source, graphically select your target and then click to replicate the data. You can filter the data by table or other criteria, but the process is simple and you don’t have to worry about the transformation process.
Attunity support a wide range of sources and targets, but check closely before you purchase because not all databases are source and target capable.
3. Cloudera CDH
Key differentiators: CDH is a distribution of Apache Hadoop and related products. It is Apache-licensed, open source, and is the only Hadoop solution that offers unified batch processing, interactive SQL, interactive search, and role-based access controls.
Cloudera claims that enterprises have downloaded CDH more than all other distributions combined. CDH offers the standard Hadoop features but adds its own user interface (Hue), enterprise-level security and integration more than 300 vendor products and services.
Cloudera offers multiple choices for starting up with Hadoop that include an Express version, an Enterprise version, and a Director (cloud) version, four Cloudera Live options, and a Cloudera demo. Additionally, you can download the Cloudera QuickStart VM for those of you who want to test in your own environment.
Key differentiators: The first big data analytics platform for Hadoop-as-a-Service designed for department-specific requirements.
Datameer Professional allows you to ingest, analyze, and visualize terabytes of structured and unstructured data from more than 60 different sources including social media, mobile data, web, machine data, marketing information, CRM data, demographics, and databases to name a few. Datameer also offers you 270 pre-built analytic functions to combine and analyze your unstructured and structured data after ingest.
Datameer focuses on big data analytics in a single application built on top of Hadoop. Datameer features a wizard-based data integration tool, iterative point-and-click analytics, drag-and-drop visualizations, and scales from a single workstation up to thousands of nodes. Datameer is available for all major Hadoop distributions.
Key differentiators: DataStax uses Apache Cassandra and Apache Hadoop as the database engine and the analytics platform that is highly scalable, fast, and capable of real-time and streaming analytics.
DataStax delivers powerful integrated analytics to 20 of the Fortune 100 companies and well-known companies such as eBay and Netflix. DataStax is built on open source software technology for its primary services: Apache Hadoop (analytics0, Apache Cassandra (NoSQL distributed database), and Apache Solr (enterprise search).
DataStax made the choice to use Apache Cassandra, which provides an “always-on” capability for DataStax Enterprise (DSE) Analytics. DataStax OpsCenter also offers a web-based visual management system for DSE that allows cluster management, point-and-click provisioning and administration, secured administration, smart data protection, and visual monitoring and tuning.
Key differentiators: Recently purchased Statistica Big Data Analytics platform features natural language processing, entity extraction, interactive visualizations and dashboards, databases, database appliances, and distributed advanced analytic models across Hadoop.
Dell’s Statistica Big Data Analytics is an integrated, configurable, cloud-enabled software platform that you can easily deploy in minutes. You can harvest sentiments from social media and the web and combine that data to better understand market traction and trends. Dell leverages Hadoop, Lucene/Solr search, and Mahout machine learning to bring you a highly scalable analytic solution running on Dell PowerEdge servers.
Dell summarizes its hardware software requirements for your Hadoop cluster simply as, 2 – 100 Linux servers for Hadoop Cluster, 6GB RAM, 2+ Core, 1TB HDD per server. The point is that entry into a Hadoop solution is simple and inexpensive. And as Dell puts it, “Gain robust big data analytics on an open and easily deployed platform.”
Key differentiators: The FICO Decision Management Suite includes the FICO Big Data Analyzer, which provides an easy way for companies to use big data analytics for decision management solutions.
FICO’s Big Data Analyzer provides purpose-built analytics for business users, analysts, and data scientists from any type of data on Hadoop. Part of FICO’s Big Data Analyzer appeal is that it masks Hadoop’s complexity, allowing any user to gain more business value from any data.
FICO provides an end-to-end analytic modeling lifecycle solution for extracting and exploring data, creating predictive models, discovering business insights, and using this data to create actionable decisions.
Key differentiators: Hadapt was recently purchased by Teradata and has a patent-pending technology that features a hybrid architecture that leverages the latest relational database research to the Hadoop platform.
Hadapt 2.0 delivers interactive applications on Hadoop through Hadapt Interactive Query, the Hadapt Development Kit for custom analytics, and integration with Tableau software. Hadapt’s hybrid storage engine features two different approaches to storage for structured and unstructured data. Structured data uses a high-performance relational engine and unstructured data uses the Hadoop Distributed File System (HDFS). Hadapt has a lot of trademarked products as part of its Adaptive Analytical Platform plus its pending patent for its complete technology solution.
Hadapt diverges from the Hadoop crowd in that it uses a relational database for its analytics and integrates data without the need to ingest Hadoop data. The advantage is that you can have simultaneous operational and analytical processing on the same data sources. This greatly improves speed and efficiency in big data analysis and requires fewer steps.
Key differentiators: 100 percent open source solution and a major contributor to the Apache Hadoop project.
Hortonworks is always on the leading edge of Hadoop and is committed to the use of open platforms for enterprise solutions surrounding big data and big data analytics. Forrester Research named Hortonworks as a technology leader and ecosystem builder for the entire Hadoop industry. Hortonworks has more volunteers involved in the Hadoop Project than any other commercial entity.
Hortonworks and a consortium of other companies developed Apache Atlas to meet the needs of working with metadata and data governance. “Atlas enhances governance capabilities in Hadoop for both prescriptive and forensic models enriched by taxonomical metadata.” Atlas is designed to exchange metadata with other tools and process inside and outside of the Hadoop stack. It also enables platform-agnostic governance controls that address enterprise compliance requirements. Atlas currently holds incubator status in Apache’s project list.
Key differentiators: If there’s a name that’s synonymous with big data, it’s IBM. The IBM Open Platform (IOP) uses a 100 percent open source solution and is 100 percent free.
IOP contains 16 different Apache projects and has full support for the Open Data Project, which is a shared effort to promote and to advance Hadoop for the enterprise. IBM offers multiple tiers of its product. You can download the IOP free of charge or select a supported offering and use it on premises. You can use IBM’s Hadoop-as-a-Service on its SoftLayer cloud infrastructure to alleviate the pain of managing your own hardware and networking components. On top of the basic underlying infrastructure and software, IBM offers up its BigInsights for Apache Hadoop product for your advanced analytical needs.
IBM’s BigInsights includes: Hadoop, SQL-on-Hadoop, business analytics tools, advanced analytics, accelerators, optimized performance, management, seamless data integration, and real-time streaming analytics.
Key differentiators: MapR is the only distribution that allows Hadoop to be accessed via the Network File System, or NFS. NFS allows faster data management and system administration without requiring multiple steps to move or to access data.
MapR provides a production-ready distribution that runs both online and analytical processing and applications on a single platform. This means that you can run more applications on one Hadoop cluster and minimize your operational costs.
MapR runs the world’s largest single production clusters of Hadoop that includes:
Linear scalability that exceeds the 100 million files limit in the Hadoop Distributed File System (HDFS), distributed metadata architecture scaling to trillions of files and tables capable of storing and processing thousands of petabytes per cluster, and processing of files and tables in one distributed storage layer. This allows NoSQL and Hadoop applications to work seamlessly on a single platform.
MapR also provides Hadoop high availability everywhere across all Apache Hadoop projects and demonstrates 99.999 percent availability.
Key differentiators: Pentaho combines data integration with analytics and features a unique “in-Hadoop” execution that results in extremely fast performance.
Pentaho’s offering connects natively to Hadoop, to NoSQL, and to analytic databases, features a visual designer for MapReduce jobs, allows you to model and explore unstructured data sets, provides a multi-threaded data integration engine, and supports cluster nodes.
Pentaho’s solution also includes what it calls its adaptive big data layer that gives you the capability to access data once, process it, combine it, and consume it anywhere. It supports Hadoop distributions from Cloudera, Hortonworks, and MapR.
Key differentiators: Pentaho produces its own components for big data analytics that includes Pivotal HD, Pivotal Greenplum Database, Pivotal GemFire, and Pivotal HAWQ.
Pivotal’s Hadoop distribution, Pivotal HD is 100 percent Apache compliant, uses other Apache components, and is based on the Open Data Platform. Pivotal GemFire is a distributed data management platform designed for diverse data management situations, but is optimized for high volume, latency-sensitive, mission-critical, transactional systems. The Pivotal GreenPlum Database is a shared-nothing, massively parallel processing (MPP) database used for business intelligence processing as well as for advanced analytics. Pivotal’s HAWQ is an ANSI compliant SQL dialect that supports application portability and the use of data visualization tools such as SAS and Tableau.
Key differentiators: Supermicro’s differentiator in this group is that it is a provider of the underlying commodity hardware that your Hadoop clusters run on.
Supermicro has partnered with Cloudera and Hortonworks to provide turnkey Hadoop cluster solutions to your business, should you decide to host your own Hadoop infrastructure. Your Hadoop implementation won’t even be worth its free price if your hardware is a bottleneck to your large data set processing.
Supermicro takes the guesswork and the opinions about which software works best with which hardware for the best performance and for the best price.
Key differentiators: Zettaset offers Hadoop cluster management software and encryption software for Hadoop data.
Zettaset tackles two very tough jobs for Hadoop: management and encryption. Encryption, while great for security, is usually not a great performer, but Zettaset’s BDEncrypt solution boasts high performance, standards-based encryption for your valuable Hadoop data.
Zettaset Orchestrator is a Hadoop management solution that enables you to address requirements for security, high availability, manageability, and scalability in a distributed computing environment. It includes encryption, role-based access controls, automation options, interoperability with BI and analytics platforms, and maintains your database availability and reliability.
Photo courtesy of Shutterstock.