Hadoop has become nearly synonymous with Big Data. It’s the database engine that distributes and manages workloads in a way that makes Big Data possible. Okay, that’s a major oversimplification, but you get the general idea.
Hadoop’s power is represented by the top-flight Big Data startups using it, such as Cloudera, Hortonworks, and MapR, which all offer commercial distributions. Hadoop is at the core of important projects at major companies, such as those at Facebook, Yahoo!, and Amazon. In fact, Hadoop providers claim that more than half of the Fortune 50 is already using Hadoop.
Of course, Hadoop isn’t the only game in town. There are Big Data alternatives like Disco, Storm, and proprietary systems from Software AG, LexisNexis, ParStream, and others. Anytime technology takes off the way Hadoop has there will be kinks and pain points along the way, opening the door for even more innovation.
But Hadoop is getting the buzz, and many IT professionals wonder what all the fuss is about. Is Hadoop really that big of a deal?
The short answer is a resounding “yes.” Hadoop is a driving force in Big Data infrastructure, and many of the alternatives basically build on what Hadoop has already achieved, solving various headaches that you may or may not care about, depending on how you intend to use it.
The Big Data space may eventually evolve away from Hadoop, but, either way, no one can deny that Hadoop played a starring role in triggering the Big Data revolution.
Here are three examples where Hadoop’s impact could actually improve people’s lives:
The Climate Corporation leverages Hadoop to help farmers cope with climate change
Unless you’re a climate-change denier (and probably also think the moon landing was staged in Hollywood), it’s pretty obvious that farmers worldwide will need to adapt to climate change quickly. If you live in California, this fact is made even clearer by our record drought.
Climate Corporation is building out a system using MapR’s distribution of Hadoop that will, hopefully, better predict weather patterns for the coming years. The company has built a system using Hadoop that creates weather projections for the next two years at every 2.5 by 2.5 kilometer grid across the U.S. They’ve mapped out the most likely 10,000 outcomes per location using different variations of likely patterns to create a probabilistic view of weather.
I should note here that Climate Corp. was acquired by Monsanto in 2013 for more than $1 billion, so any green-washing critiques you want to make certainly have some legs, but that doesn’t minimize what Climate Corp. is doing.
“We are proud of using Hadoop to provide a class of weather insurance for farmers never before available and to do it in a way where, with index-based weather insurance, farmers have access to an independently sold product that changes how they manage risk. Since 85 percent of farmers’ risks are weather related, this is our impact on the world,” said Andy Mutz, director of engineering for Climate Corp.
Climate Corp. uses Hadoop to help simulate weather and to create risk portfolios that they sell to risk/underwriting partners. The goal is to help farmers understand the risks of their practices and to help them reduce those risks by changing their practices and by helping underwrite weather insurance against adverse effects. Hadoop is a central part of the weather simulation process and also is central to the process of aggregating financial data to create risk portfolios that Climate Corp. then sells to partners.
The Durkheim Project combats suicide in the military
Suicide is an issue the U.S. military has struggled with for years. In 2012, a record number of 349 military suicides took place, which far exceeded the number of American combat deaths in Afghanistan for the same year. The rate of military suicides is roughly double those of adults in the general U.S. population.
To get better insight into the problem, Predictive analytics firm Patterns and Predictions (P&P) created the Durkheim Project. Built on top of Cloudera’s distribution of Hadoop, the project uses an array of advanced analytics, real-time predictive modeling, and machine learning, all of which work in concert to identify critical correlations between veterans’ communications and suicide risk.
“One of the promises of Big Data in this case is that you can shorten the distance between the people who need help and the system that can get them help,” said P&P founder Chris Poulin.
Phase one concluded in early 2013. Project coordinators found their predictive capabilities were 65 percent accurate at predicting suicide risk among a veteran control group. Phase two launched in July 2013. They are attempting to eventually get 100,000 veterans to opt in to the study. Participants who opt in receive a unique Facebook app and a mobile app designed to capture posts, Tweets, mobile uploads, and even location. Additional profile data is captured as well, including physician information and clinical notes.
Eventually, the project hopes to save lives by enabling professionals to intervene before a suicide takes place. Still in its initial phases, though, the Durkheim Project is authorized only to monitor and analyze data. While the project has delivered statistically valid results that accurately predict suicide risk in a control group of veterans, its critical research is restricted, at least for the time being, to a non-interventional protocol.
UC Irvine Health improves clinical operations and scientific research with Hadoop
The Clinical Informatics Group (CIG) at UC Irvine Health (UCIH) was founded in 2009 to provide high- quality data to support the work done by researchers and clinicians at UC Irvine. However, as with many organizations, much of UCIH’s data was scattered across multiple Excel spreadsheets. UCIH also had 9 million semi-structured records for 1.2 million patients over 22 years, none of which was searchable or retrievable. These semi-structured records included dictated radiology reports, pathology reports, and rounding notes – all very valuable, in aggregate. But it was not accessible in the aggregate.
The CIG first migrated data to an enterprise data warehouse with integrated clinical business intelligence tools. Then, they migrated again to their current Big Data architecture, which is built on Hortonworks Data Platform (HDP).
The single Hadoop “data lake” at UCIH serves two different constituents: The UC Irvine School of Medicine for medical research and the UC Irvine Medical Center (UCIMC) for the quality of its clinical practice. The medical school and the hospital have distinct Big Data use cases, but they are both able to use a unified data platform with HDP at its core.
“Hadoop is the only technology that allows healthcare to store data in its native form. If Hadoop didn’t exist we would still have to make decisions about what can come into our data warehouse or the electronic medical record (and what cannot). Now, we can bring everything into Hadoop, regardless of data format or speed of ingest. If I find a new data source, I can start storing it the day that I learn about it. We leave no data behind,” said Charles Boicey, who was previously an informatics solutions architect with UCIMC. (Boicey recently accepted a new position with Stony Brook Medical.)
“Now back to those 9 million semi-structured legacy records. They are now searchable and retrievable in the Hadoop Distributed File System. This allowed the UCIH team to turn off their legacy system that was used for view only, saving them more than $500,000,” Boicey added.
The CIG has already launched two new data-driven programs, one a pilot program that allows nurses to remotely monitor patient vitals in real-time and another that seeks to reduce patient re-admittance.
One of UCIH’s top goals is to predict the likelihood of hospital re-admittance within 30 days after discharge. Patients with congestive heart failure have a tendency to build up fluid, which causes them to gain weight. Rapid weight gain over a 1-2 day period is a sign that something is wrong and that the patient should see a doctor.
UCIH developed a program that sends those heart patients home with a scale and instructions to weigh themselves once daily. The weight data is wirelessly transmitted to Hadoop where an algorithm determines which weight changes indicate risk of re-admittance. The system notifies clinicians about only those cases. All home monitoring data will be viewable in the EMR via an API to Hadoop.
Photo courtesy of Shutterstock.