Everyone knows that data volumes are growing exponentially. What’s not so clear is how to unlock the value all of that data holds. Enterprises are struggling to figure out how to store, manage and derive any real business value from Big Data.
Part of the problem is that traditional databases just aren’t suited for mining Big Data insights. Legacy systems were designed decades ago, long before Big Data was a trend.
Enter Apache Hadoop, an open-source framework that enables the processing of large data sets in a distributed environment. With Hadoop, applications can be run on systems composed of thousands of nodes with thousands of terabytes data.
Gartner estimates the current Hadoop ecosystem market to be worth around $77 million. They expect that it will grow to $813 million by 2016. However, despite a few big-name backers, Hadoop is still relatively unproven in enterprise settings. Critics argue that while Hadoop works great as a processing platform, it’s not all that good with queries. The add-ons Hive and Pig both help with this, but Hadoop still isn’t quite a fully mature platform.
These startups intend to change that.
What they do: Provide data science solutions for Hadoop and Big Data
Headquarters: San Mateo, CA
CEO: Joe Otto, who previously ran Worldwide Sales for Greenplum, which is now part of EMC.
Founded: 2010
Funding: Alpine Data Labs is backed by a $7.5 million Series A round of funding from Sierra Ventures and Mission Ventures, along with EMC and Sumitomo Bank. The company is in the process of closing out a Series B round, which is expected to raise between $10 and $13 million.
Why they’re on this list: While there are a ton of Big Data tools entering the market, many companies still struggle to gain actionable insight from their mountains of data.
According to Alpine Data, part of the problem is that it’s much too difficult to get real insights out of Hadoop and other parallel platforms. Most companies don’t know what to do with massive datasets, and few have gotten any further with Hadoop than batch processing and basic querying.
Alpine Data set out to simplify machine-learning methods and make them available on petabyte-scale datasets. Their tools make these methods available in a lightweight web application with a code-free, drag-and-drop interface.
Alpine Data leverages the parallel processing power of Hadoop and MPP databases and implements data mining algorithms in MapReduce and SQL. Users interact with their data directly where it already sits and design analytics workflows without worrying about data movement or complex code. All this is done in a web browser, and Alpine Data then translates these visual workflows into a sequence of in-database or MapReduce tasks.
Alpine Data’s visual environment helps teams collaborate and quickly create and deploy analytics workflows and predictive models.
Customers include AT Kearney, Havas Digital, Zion Bank, Kaiser Permanente and CareCore
Competitors: SAS dominates this market, but other startups are moving into this space too, including Platfora, Skytree, Revolution Analytics and Rapid-I.
2. Cloudera
What they do: Provide a Hadoop-based Big Data Platform
Headquarters: Palo Alto, CA
CEO: Mike Olson, who was formerly CEO of Sleepycat Software, an embedded database company that was acquired by Oracle in 2006. After the acquisition, Olson spent two years at Oracle as VP for Embedded Technologies.
Founded: 2008
Funding: Cloudera has raised $140 million in venture capital to date. Its investors include Accel Partners Greylock Partners, Ignition Partners, In-Q-Tel and Meritech Capital Partners.
Why they’re on this list: Big Data is hot, and Cloudera is the pioneer that first developed a Hadoop-based platform for Big Data. Moreover, they’re sitting on a mountain of VC cash and have a solid management team.
Cloudera lets users query all of their structured and unstructured data and have a view beyond what’s available from relational databases. Cloudera recently released Impala, a new open-source interactive query engine for Hadoop that enables interactive querying on massive data sets in real time.
Customers include CBS Interactive, eBay, Expedia, Monsanto and Samsung.
Competitors: EMC Pivotal, Hortonworks, MapR. Intel recently joined the market as well, but it’s too early to tell how serious they are about this space.
3. Continuuity
What they do: Provide a Hadoop-based Big Data application hosting platform
Headquarters: Palo Alto, CA
CEO: Todd Papaioannou, who was previously an Entrepreneur in Residence at Battery Ventures.
Founded: 2011
Funding: $12.5 million from Battery Ventures, Ignition Partners, Andreessen Horowitz, Data Collective and Amplify Partners.
Why they’re on this list: Continuuity is banking in on two big trends: cloud computing and Big Data. Delivered as a cloud platform, Continuuity’s Big Data application hosting platform, Continuuity AppFabric helps existing application developers invent, deploy and manage Big Data applications.
The Continuuity AppFabric is built on top of existing Hadoop infrastructure components, but is intended to shield developers from the project’s complexity. When an app is ready, developers can push it to the cloud with a button click. They are then able to try it out on Continuuity’s Developer Sandbox and then push it to production.
Then, the AppFabric UI provides real-time information about applications, allowing developers to dynamically scale them by simply clicking a “+” button to increase capacity without taking the app offline or even having to think about infrastructure.
Since Big Data as a concept is all over the map, it’s a smart move to help developers create custom apps that target their specific Big Data goals.
Competitors: These aren’t all apples-to-apples direct competitors, but Continuuity will compete with AWS’s Elastic MapReduce, Infochimps, Mortar Data, Qubole, and Treasure Data.
What they do: Provide a Hadoop/NoSQL Big Data platform
Headquarters: San Jose, CA
CEO: John Schroeder. He was previously CEO of Calista Technologies, which was acquired by Microsoft, and also CEO of Rainfinity, which EMC purchased.
Founded: 2009
Funding: MapR just closed a $30 million round in March 2013, bringing its total funding to $59 million. The round was led by new investor Mayfield Fund and also includes existing investors Lightspeed Venture Partners, NEA and Redpoint Ventures.
What they’re on this list: Hadoop and anything to do with Big Data are getting a lot of attention these days. Big names such as Yahoo and Facebook have both built applications on top of Hadoop. Meanwhile, Big Data promises to transform organizations as data analysts turn up all sorts of insights that were previously opaque.
MapR claims that it is able to merge Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. Speed has been an issue with Hadoop, but MapR claims to have cleared this hurdle, while also offering such enterprise-grade features as “High Availability, business continuity, real-time streaming, standard file-based access through NFS, full database access through ODBC, and support for mission-critical SLAs.”
MapR’s current customers include Ancestry.com, the Rubicon Project, comScore and NextBio.
Competitors: MapR’s closest competitor is Cloudera. Others include EMC Pivotal, and Hortonworks. Intel recently joined the market as well, although it remains to be seen how much of an actual threat they will present.
5. Mortar Data
What they do: Deliver the Hadoop platform as a service for building Big Data pipelines
Headquarters: New York, NY
CEO: K Young, who was previously an Architect Wireless Generation, where he developed three flagship products (TPRI, DIBELS, ARIS).
Founded: 2011
Funding: $1.8 million from Genacast Ventures, Atlas Venture, Great Oaks Ventures, Chris Lynch (former CEO, Vertica) Matt Turck (Managing Director, FirstMark Capital), Richard Dale (co-founder Phase Forward), TechStars, and other undisclosed private investors.
Why they’re on this list: Mortar Data bills itself as a company that can deliver “Hadoop in an hour.” Considering that the complexity of Hadoop can scare plenty of potential users away, this is solid positioning—assuming they live up to that promise. Mortar is also focused exclusively on engineers and data scientists, rather than analysts.
Mortar Data’s service is designed to facilitate team collaboration, allowing users to easily share, repeat and maintain their code. Data scientists and engineers using Mortar get full code history, full execution history, automated testing and one-button deployment.
Competitors: AWS’s Elastic MapReduce, Infochimps, Qubole, Continuuity, and Treasure Data.
6. Platfora
What they do: Develop software that transforms raw data in Hadoop into interactive, in-memory business intelligence
Headquarters: San Mateo, CA
CEO: Ben Werther, who was previously VP of Products at DataStax.
Founded: 2011
Funding: $27.2 million to date from Andreessen Horowitz, Battery Ventures, Sutter Hill Ventures and In-Q-Tel.
Why they’re on this list: They’re focused on the main challenge of Big Data, namely, how to make sense of it. They have a solid management team and an impressive amount of VC funding.
While businesses have been rapidly adopting Apache Hadoop as a scalable and inexpensive solution to store near-infinite amounts of data, they struggle to extract value from that data. Traditional relational database and analytics tools just can’t deal with massive amounts of structured and unstructured data. So businesses must perform a complex and rigid set of steps between the customer interactions that generate data and analyzing that data with business intelligence (BI) software. These steps include ETL (extract, transform, load) processing, building a data warehouse and connecting to a visualizer tool. Business users are required to be experts who understand MapReduce or SQL coding in order to access the data.
Platfora tries to simplify that process and automatically transform raw data in Hadoop into interactive, in-memory business intelligence, with no ETL or data warehousing required. Platfora provides an exploratory BI and analytics platform designed for business analysts and not just IT.
Edmunds.com is an early customer.
Competitors: SAS, Alpine Data, Platfora, Skytree, Revolution Analytics and Rapid-I.
What they do: Provide a Hadoop-based, SQL-compliant database designed for Big Data applications
Headquarters: San Francisco, CA
CEO: Monte Zweben. Zweben’s early career was spent with the NASA Ames Research Center as the Deputy Branch Chief of the Artificial Intelligence Branch. Later, he founded and served as CEO of Red Pepper Software and then Blue Martini Software.
Founded: October 2012
Funding: They are backed by $4 million in funding from Mohr Davidow Ventures.
Why they’re on this list: As Hadoop and NoSQL platforms catch on, users eventually run into a problem: limited SQL support is forcing users to rewrite existing apps or BI reports, which a very costly process. Splice Machine argues that Big Data and application developer communities need a more cost-effective database to power their applications—one that combines the scalability and availability of NoSQL with the power and popularity of SQL.
Built on the Hadoop stack, the Splice SQL Engine enables application developers to build hyper-personalized web, mobile and social applications that can achieve Big Data scale, but the platform also allows users to leverage the ubiquity of SQL tools and skill sets in the marketplace.
Splice Machine contends that other “SQL on Hadoop” products (such as Apache Hive) are read-only, analytics-only solutions. Those analytics solutions cannot support real-time updates and ACID (Atomicity, Consistency, Isolation, Durability) transactions. Splice Machine is designed to support both operational (OLTP) and analytic (OLAP) workloads with real-time queries.
Competitors: Apache Hive, Cloudera Impala, Apache Drill
Jeff Vance is a Santa Monica-based writer. He’s the founder of Startup50, a site devoted to emerging tech startups, and he also founded the content marketing firm, Sandstorm Media. Connect with him on Twitter @JWVance.