MapR Delivers on Big Data with Apache Hadoop

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Simply stated, MapR Technologies delivers an enterprise-class distribution for Apache Hadoop with an in-Hadoop NoSQL (Not-only SQL) database. What that does not state, though, is how critical this is to making big data and operational data both work not only at scale, but with the performance, availability, security and other characteristics that give meaning to the use of the term “enterprise-class.” Let’s consider why that is the case.

The Data Deluge Demands a New Architecture

The data deluge continues. IDC (International Data Corporation) defines three IT platforms over time. The mainframe (first) and client-server (second) are not going away, but the third platform (cloud, mobile, social media and big data) has to be taken fully into account. The first two platforms tend to be application-driven, whereas the third tends to be primarily data-driven.

As I have said before, application-driven means that the data is created and exists to meet the needs of the application (such as an online transaction processing system) so the pair is tightly coupled. Data-driven means that the data is created and exists to meet its own needs; yes, an application may support its creation (such as an e-mail), but the data has meaning and value apart from the application. Applications in a data-driven world are servants, not masters.

The mainframe is still the torch bearer for SQL-type structured databases. The client-server architecture has a home for a lot of structured data (that can be sorted in traditional relational databases), but also deals with a lot of semi-structured data (e-mails, word processing documents and the like that can be searched). The third platform is a real mixed bag of structured data, semi-structured data and true unstructured data (bit-mapped data that can be sensed, such as video, but neither sorted nor searched directly, except for attached metadata).

Enterprises create all these varieties of data in seemingly endless volumes, so it would be incredibly handy to be able to manage all types in a single database. However, that database cannot be a table-limited, SQL-type database (which while very, very useful, was designed for a much more constrained data world). And the answer is Hadoop.

Hadoop in its Native State Is Not Enough

Open source Hadoop is the new database architecture of choice. This “shared nothing” (distributed memory or disks, for example) approach has proved to be very popular and is an exemplar of why the open source movement has proven to be so productive. Alas, while Hadoop has proven its worth, it also has its limitations, such as not being inherently enterprise-grade or enterprise-class.

Now this may not matter for a beneficial big data predictive analysis where, if the working copy of data went away or security were compromised, no revenue was lost and no really sensitive information was compromised, so no harm was done. But in enterprises where data analysis and tight security have real value (and whose employees could lose their jobs if something major goes wrong), enterprise-class is not a buzz word, but a necessary state.

Moreover, enterprises may value other characteristics, such as performance, that not all distributions of Hadoop are able to deliver. Now, there are at least a couple of alternatives to solving the problem. One is to deliver a non-Hadoop database with similar capabilities but that is also enriched. This gives the developer total future product development control, but does not take advantage of the leverage of no- or low-cost open source development. The second alternative is to build upon the Hadoop framework but add other extensions that enrich the distribution by adhering to standard APIs.

That yields a proprietary solution that adds value but does not provide lock-in. MapR Technologies has elected to follow this alternative.

What MapR Technologies Brings to the Table

Big data disrupts traditional IT thinking in many ways. One is in the overwhelming number of data types that have to be managed. Big data can include transactions (credit card transactions live and call detail records historical), streamed data (sensor/machine-based data, such as from the Internet of Things), interactional data (such as clickstream results), and observational data (such as customer sentiments). As has been said often, data has now taken its place as the fourth production factor beside the three in traditional economics — land, labor, and capital.

Traditional data warehouses still have a place in the world, but analyst Doug Laney (now of Gartner) had it right when he talked about the volume, variety and velocity of data; this leads to the need for a new data architecture. Hadoop represents that new world not only in being able to handle all data types, but also with having a schema required only when data read is critical. This is in stark contrast to the expensive and time consuming process of designing and building a build-it-and-hope-they-will-come traditional data warehouse and close coupling of processing with data in a scale-out parallel processing fashion.

A fundamental capability offered by MapR is that it tightly couples the analytical-base that Hadoop delivers with the operational capabilities that traditional relational databases, such as Oracle, IBM, or Microsoft, provide. For many analyses, such as a churn analysis, which is a predictive analysis, historical information delivered in a batch mode is sufficient. But in a world of mobile application servers and Web application servers, the ability to operate in real-time with user data (such as user profiles and states, user interactions and real-time location data) is crucial.

In short, MapR provides a level of integration across data types that native Hadoop does not deliver. That includes support for mixed workloads, integrated search capabilities, and the ability to deploy and manage as a single cluster for Hadoop and NoSQL. In addition, as part of its in-Hadoop database, MapR delivers enterprise-friendly capabilities for a distributed, unified namespace, data management and data protection (including disaster recovery) for all data files and tables.

As part of its enterprise-class capabilities, MapR claims zero downtime, the high availability and disaster recovery capabilities that enterprises require, through no single point of failure, instant recovery upon node failure, and no regular maintenance and downtime. Security is key for claiming enterprise-class status and MapR delivers wire-level authentication and encryption as well as fine-grained access control. And for database reliability, MapR provides for the famous ACID (atomicity, consistency, isolation and durability) capabilities for row-level transactions.

But all these capabilities (and many others as well) would be for naught if MapR didn’t also deliver high performance and scalability. MapR states it has less than 2 ms (millisecond) response time with consistent low read latency, as well as claiming that it has four to ten times better throughput compared to other NoSQL databases (although competitors may beg to differ). MapR is scalable to 10,000 nodes with the ability to handle millions of columns, trillions of rows per table and up to one trillion tables. Big data is big, but it is not infinite and this should be enough for the vast majority of cases (probably all, but one has to allow for extreme cases).

Mesabi Musings

If data is indeed the fourth production factor in economics, then its impact is not only huge now, but will become ever more important as time goes by. And “big data” is the code word to think about in regards to that deluge of data and what it means. In addition, traditional data architectures are not able to meet the needs of volume (tsunami rather than fire hose is probably the best analogy), variety (the number of data types seems to be exploding), and velocity (mercurial). Hadoop was meant to address the situation, but while the direction and basic capabilities pursued by that open source community are on the right track, native Hadoop alone is not enough.

MapR believes that its in-Hadoop database distribution addresses the issues. With its integrative capabilities, MapR marries traditional batch analytics with real-time operational data. With its performance and scalability, it meets the needs of a wide range of use cases. And with its enterprise-class capabilities, enterprises should have no reason for not giving MapR Technologies a good look for managing the big-and-getting-bigger data deluge.

Photo courtesy of Shutterstock.