As datasets grow larger, Big Data experts are facing the increasing problem of getting their data ready before it can actually be processed. A recent report found that business intelligence professionals spend anywhere from 50% to 90% of their time preparing data before they can even put it through analytics apps.
Xplenty, which sells a Big Data transformation platform, surveyed more than 200 BI professionals from across the U.S. on several areas of the ETL (Extract, Transform and Load) process. Almost all – 97% – said ETL was vital to business processes.
When asked what the biggest challenges were in making data "analytics-ready," 55% said integrating data from different platforms, followed by transforming, cleansing and formatting incoming data (39%), integrating relational and non-relational data (32%), and the sheer volume of data that needs to be managed (21%) at any given time. Nearly a third of those polled (30%) said that they spend between 50-90% of their time just on ETL alone.
This is hardly a new issue, said Yaniv Mor, CEO of Xplenty. "Things have been like that for the past 30 to 40 years," he said. "When datawarehousing started, they understood it had to be ready for analytics, so it had to stored differently. That's how the process of moving data from one place to another started."
In the early days of analytics, it used to be you only had a single database or source. Nowadays, you need to take data from many more places because data resides in many different places, he said. This includes on premises and on the cloud and SQL and non-SQL databases.
So it's not a case of companies are sloppy with their data, it's just the nature of the beast with many data sources that prep work will end up consuming most of your time. "To do analytics you have to do a lot of work. It's just the way it is. It became more and more challenging as time has gone by because there is so much more data and data formats have evolved and the locations of the data," he said.
He added: "the nasty thing about it is it's one of the most boring jobs as well. You have to conform all the ways people have entered a name and so on. It can be very monotonous."
Tim Crawford, who heads his own CIO consulting firm AVOA, echoed this sentiment but added the problem is in the way we think of data and manage data. "We come from a world where we're accustomed to structured data. We know where it comes from, [so we can] control the lineage," he said.
"Now the number of sources and volume has grown exponentially and the data is unstructured. The problem is we haven't as an industry figured out how to up-level the conversation around how we manage data. We need to think about how we manage data before we even touch it," added Crawford.
Holger Mueller, principal analyst and vice president with Constellation Research, said that the original problem of managing data had been solved in the late 1990s by Business Objects, Cognos, and Micro Strategy. End users were able to build the reports and charts they wanted easily with client-side apps that didn't require programming or data science skill.
But that was transactional data coming from typical enterprise apps that used typical rows and columns data structures. "It was tough at the time but compared to today it was easy to work with," he said. Now users are struggling again because relevant data is no longer in traditional data stores and the old methods don’t work anymore.
"Someone who was a great data prepper in data warehousing can't use much of their skills today because they can't put it in a data warehouse anymore," he said.
"Hadoop allows people to store data in a cheap way and ask questions later," said Mueller. "That's why we're talking about data lakes. You're just packing it away until you figure out how to use it. Before it was, ‘Tell me what you want and I will build a report for you.’ Now you need different sources. More has to be done for the business user to get to his insight because there are more forms of data out there and no one vendor who controls them all."
For a while, as Mueller pointed out, the problem was solved with front-end tools that simplified the process of making reports. With the move to the cloud and Big Data, people are reverting back to coding because there are no suitable tools to help them do those tasks in a more manageable way. But let's be honest, you can't have non-technical people writing code to sort through Big Data stores.
Crawford said the fix is not only technological but how people approach this problem. Namely, it required forethought up front on how to manage data before it gets into the system, even before the "cleaning" process.
"There is a need for new technology, however, the core problem will not be solved by technology, and that's a people problem. I still run into people who are thinking in ways to structure unstructured data. We need to think differently about data. How we leverage it and correlate it, not just the mechanics of how we store and maintain it," he said.
Of course, that's no magic bullet. Rather, it requires someone in a leadership capacity with forethought and vision and understanding the business they operate in to make this happen.
"First and foremost, do we understand how the business operates? Can start with data and work toward business or start with the business and work towards the data," said Crawford.
Because the solution has to be centered around the business problem the company is trying to solve, that makes it difficult for technological solutions. They have to be custom to the company and therefore, put together by the company. Just like DevOps.
"The concepts around DevOps are pretty generic," said Crawford. "How it gets implemented is very unique to that organization. You don’t just take it off the shelf and expect it to work. The same thing is true with regards to data. How it looks will absolutely be unique to that organization."
Mor agreed and said there needs to be more data governance by people and not total reliance on automation. Organizations need to enforce strict data rules, which are very complex and hard to automate. "Many organizations try to enforce data governance policies but very few succeed. It all comes down to so much data to be handled and stored, especially now that storage is so cheap and you can store it in the cloud," he said.
That said, Mor does still think tools are necessary but they aren't the be-all, end-all solution. "The promise of being fully automated is somewhat like the promise of the paperless office. There's a lot of analytical tools that claim they can hook into any data source and no ETL, they will bring you the analytics. It's nice to sell such a concept but it's almost never actually done," he said.
Automation works nicely with a single data source or simple data sources, but the minute you have different formats and the data sits in many data locations, you have to do some manual data integration operation.
However, Mor sees the demand for tools is growing and will continue to grow rapidly. "More and more companies understand they have to be data-driven to remain competitive. Because of the challenges I have mentioned and so much data in so many locations, there will be more and more demand for ETL tools that would be very easy to use," he said.
Photo courtesy of Shutterstock.